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Preface 



This volume contains all the papers presented at SSPR2000 and SPR2000, held 
at the University of Alicante, Spain, August 30 - September 1, 2000. 

This was the second time these two technical workshops were held back- 
to-back. SSPR2000 was the eighth meeting of the international workshop on 
Structural and Syntactic Pattern Recognition and SPR2000 was the third in- 
ternational workshop on Statistical Techniques in Pattern Recognition. 

These workshops have been traditionally sponsored by two of the most re- 
presentative technical committees of the International Association of Pattern 
Recognition (lAPR): the TC2 and TCI, respectively. 

A total of 130 papers was received for consideration from almost 40 different 
countries. Both the submission and the reviewing process were carried out sepa- 
rately for each workshop even though papers were distributed among reviewers 
in both program committees according to their particular previously expressed 
interests. 

As a result of this “joint” reviewing process, papers were distributed into 
three categories: SSPR, SPR, and SSSPR. We designed the technical program 
of the joint workshop according to the accepted papers and five oral sessions 
were allocated for each of the two first categories and two oral sessions (eight 
papers) for the third one. 

The two poster sessions (one per workshop) were held at the same time and 
were allocated a large amount of space and time to encourage discussion and 
interaction among researchers. A total of 52 papers was selected for oral pre- 
sentation and 35 papers were presented in the two poster sessions. In addition, 
we invited five distinguished speakers, Jim Bezdek from the University of West 
Florida, USA, Marco Gori from the Universita di Siena, Italy, Colin de la Hi- 
guera from the Universite Jean Monnet, Saint Etienne, France, Sarunas Raudys 
from the Institute of Mathematics and Informatics, Vilnius, Lithuania, and Josef 
Kittler from the University of Surrey, UK. 

SSPR 2000 and SPR 2000 were sponsored by the Conselleria d’Educacio de 
la Generalitat Valenciana under grant RGOO-01-22, the Departament de Llengu- 
atges i Sistemes Informatics (DLSI) and the Escuela Politecnia Superior of the 
Universitat d’Alacant, the Departament d’Informatica (DI) of the Universitat 
de Valencia, and the International Association of Pattern Recognition (lAPR) . 

We would like to express our gratitude to all our sponsors and, specially, 
to the members of the two program committees who faced a really tough task 
which has lead to a selection of papers of a very high quality. 

Special thanks are due to Ricardo Ferris, Esther de Ves, and Elena Diaz of 
the DI, Universitat de Valencia, and Francisco Moreno-Seco, Jorge Calera-Rubio, 
and Luisa Mico of the DLSI, Universitat d’Alacant, for their unbeatable effort in 
the organization of the workshops and in the preparation of the proceedings. We 
appreciate the help and understanding of the editorial staff of Springer- Verlag, in 
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particular Alfred Hofmann, who supported the publication of these proceedings 
in their LNCS series. 

Finally, we would like to mention two colleagues who passed away during the 
preparation of these workshops: Pierre Devijver, after whom the last of the 5 
plenary papers in the workshop is named, and Edzard Gelsema who died whilst 
holding the office of President of the lAPR. 
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Some Notes on Twenty One (21) Nearest Prototype 

Classifiers 



James C. BezdekOand Ludmila I. Kuncheva^ 

‘Computer Science Department, University of West Florida, 
Pensacola, FL, 32514, USA 
j bezdek @ u wf . edu 

^School of Informatics, University of Wales 
LL57 lUT Bangor, UK 
masOOa@bangor.ac.uk 



Abstract. Comparisons made in two studies of 21 methods for finding 
prototypes upon which to base the nearest prototype classifier are discussed. 
The criteria used to compare the methods are hy whether they: (i) select or 
extract point prototypes; (ii) employ pre- or post-supervision; and (iii) specify 
the number of prototypes a priori, or obtain this number “automatically”. 
Numerical experiments with 5 data sets suggest that pre-supervised, extraction 
methods offer a better chance for success to the casual user than post- 
supervised, selection schemes. Our calculations also suggest that methods which 
find the "best" number of prototypes "automatically" are not superior to user 
specification of this parameter. 

Keywords. Data condensation and editing. Nearest neighbor classifiers. Nearest 
prototype classifiers. Post-supervision, Pre-supervision 



1 Introduction 

The methods discussed begin with a crisply labeled set of training data X(|.={Xj,...,Xjj} 

C IR*’ that contains at least one point with class label i, 1 • i • c. Let x C 91'’ be a vector 
that we wish to label as belonging to one of the c classes. The standard nearest 
prototype (1-np) classification rule assigns x to the class of the “most similar” 
prototyle in a set of labeled prototypes (or reference set), say V={v.,...,v }. I V I = n„ 

p ^ 

may be less than, equal to, or greater than c [1]. 

We use two performance measures to compare 1-np designs. V) is the 

resubstitution (or training) error committed by the 1-np rule that uses V when applied 
to the training data; V) is the generalization (testing) error of the same 

classifier when applied to a test set X^^^^ c . 



’Research supported by ONR grant 00014-96-1-0642. 

F.J. Fern et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 1-16, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 
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Good prototypes for 1-np classification have two desirable properties: minimal 
cardinality (minimum n ) and maximum classification accuracy (minimum 

). However, these two goals naturally conflict. Increasing n up to 
some experimentally determined upper limit (usually with n > c) almost always 

P 

results in a decreasing trend in V), and conversely. One goal of our 

research is to study this conflict - how to find the smallest set of prototypes that 
provide an acceptable generalization error. 

There are four types of class labels - crisp, fuzzy, probabilistic and possibilistic and 

three sets of label vectors in 91'^ : 

Npc = -^ e e [0, 1] V i, y . > 0 3 ij-= [0,1]'^ - { 0 } (possibilistic); (1) 

Nfc = |y e Np,.: Z yi = l| (fuzzy or probabilistic); (2) 

^hc ={y ^ (crisp). (3) 

For convenience we call all non-crisp labels soft labels. An example of soft labeling 
is diagnosing ischemic heart disease, where occlusion of the three main coronary 
arteries can be expressed by such a label, each entry being the degree of occlusion of a 
particular vessel. 

A useful framework for most of the methods we discuss is the generalized nearest 
prototype classifier (GNPC). If x and the v.’s are represented by feature vectors in 

91^ , prototypical similarity is almost always based on some function of pairwise 
distances between x and the v.’s. Specifically, let x e 91^ be an input vector. The 
GNPC is defined by the 5-tuple [2, 3] : 

1. A set of prototypes V = {Vj, •••, } c 91^ ; (GNPCl) 

2. A c X n prototype label matrix L(V) = [l(Vj), ..., Kv^)] g 91^^ X 9I"p ; (GNPC2) 

3. A similarity function S(Xj^, v.) = ©(|xj^ - valued in [0,1] . ; (GNPC3) 

4. A T-norm to fuse {(l.(Vj), S(x, Vj)):I < i < c;I < J < rip}. ; (GNPC4) 

5. An aggregation operator A which, for class i, i = 1 to c, combines 
{T(l.(v.),S(x,v.)): I< J<iip} as l(x) = A[{T(l.(Vj),S(x,v.)):I < J <rip}], 

the i-th element of the overall soft label for x. . (GNPC5) 

Figure 1 shows some of the many groups of classifiers that belong to the GNPC 
family. Abbreviations in Figure 1: hard c-means (HCM), nearest neighbor (1-nn), 
learning vector quantization (LVQ) and radial basis function (RBF). We use other 
abbreviations, each of which is defined in the sequel. Many 1-np and other classifiers 
can be realized by different choices of the parameters in (GNPC1-GNPC5). When the 
prototypes have soft labels, each prototype may “vote” with varying assurance for all c 
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classes. For example, if v,has the soft label [0.2, 0.4, 0.7], it is a fairly typical example 
of class 3, but is also related (less strongly) to classes 1 and 2. 

Among the many characteristics of prototype extraction methods for 1-np classifier 
design that can be discussed, we consider the following three to be most important: 
(Cl) Selection (V^ £ X) versus replacement (V^ ct X). When talking about 

prototypes in general, we use the symbol V. When emphasis is needed, we use 
subscripts ( for selection of S-prototypes from X, for replacement of X by R- 

prototypes). Replacement seeks n points in 91^ , so the search space is infinite. 

p 

Selection is limited to searching in X cz 91^ , so solutions in this case can be sought by 
combinatorial optimization. When the prototypes are selected points from the training 
data, a 1-np classifier based on them is called a nearest neighbor (1-nn) rule. When all 
of the training data are used as prototypes, it is the 1-nn rule; and when multiple votes 
(say k of them) are allowed and aggregated, we have the well known k-nn rule 
classifier. 




Crisp label for x 



Fig. 1. A few models that are generalized nearest prototype classifiers 



(C2) Pre-supervised versus post-supervised designs [4]. Pre-supervised methods 
use the data and the class labels to locate the prototypes. Post- supervised methods first 
find prototypes without regard to the training data labels, and then assign a class label 
to (relabel) each prototype. Selection methods are naturally pre-supervised, because 
each prototype is a data point and already has its (presumably true) label. 

(C3) User-defined n versus algorithmically defined n . Most prototype generators 

P P 

require advance specification of n (e.g., classical clustering and competitive learning 

P 

methods). Some models have "adaptive" variants where an initially specified n can 

p 



4 



J.C. Bezdek and L.I. Kuncheva 



increase or decrease, i.e., prototypes are added or deleted during training under the 
guidance of a mathematical criterion of prototype "quality”. A third group of methods 
do not specify n at all, instead obtaining it as an output at the termination of training. 

P 

For example, condensation methods which search for a minimal possible consistent set 
belong to this category. Genetic algorithms and tabu search methods have a trade-off 
parameter which pits the weight of a misclassification against an increase in the 
cardinality of V by 1. Thus, methods based on these types of search deliver the 
number of prototypes at termination of training. A method that finds or alters n during 

P 

training will be called an auto-n method; otherwise, the method is user-n . 

p p 

Table 1 lists 21 methods for prototype generation reviewed here. Eleven of the 
methods are discussed in [5] ; 16 of the methods are discussed in [6]; and six methods 
are discussed in both [5, 6]. A pertinent reference for each method is given, along with 
its classification by the criteria Cl, C2 and C3 : selection = (S), replacement = (R), 
pre- supervised = [PRE], post- supervised = (post), auto-n = (A) and user-n = (U). 

P P 

The notation (A) ^ means that the algorithm can only decrease n . 



Table 1. Twenty one (among zillions of!) methods for finding prototypes 



Ref 


Acronym 


See 


Cl 

S or R 


C2 

Pre/Pos 

t 


C3 

n 

p 


[5] 


Wh-H 


[5] 


(S) 


[PRE] 


(A) 


[9] 


Tabu 


[5] 


(S) 


[PRE] 


(A) 


[21] 


LVQl 


[5] 


(R) 


[PRE] 


(U) 


[22] 


DSM 


[5] 


(R) 


[PRE] 


(U) 


[23] 


LVQTC 


[5] 


(R) 


[PRE] 


(A) —1 


[3] 


GA 


[5,6] 


(S) 


[PRE] 


(A) 


[14] 


RND = RS 


[5,6] 


(S) 


[PRE] 


(U) 


[20] 


BTS(3) = BS 


[5,6] 


(R) 


[PRE] 


(U) 


[25] 


VQ 


[5,6] 


(R) 


(post) 


(U) 


[26] 


GLVQ-F 


[5,6] 


(R) 


(post) 


(U) 


[30] 


HCM 


[5,6] 


(R) 


(post) 


(A) —1 


[18] 


Chang 


[6] 


(R) 


[PRE] 


(A) —1 


[19] 


MCA 


[6] 


(R) 


[PRE] 


(A) —1 


[10] 


MCS 


[6] 


(S) 


[PRE] 


(A) 


na 


Sample Means 


[6] 


(R) 


[PRE] 


(A) 


[27] 


DR 


[6] 


(R) 


(post) 


(U) 


[31] 


MFCM(3) 


[6] 


(R) 


(post) 


(U) 


[29] 


FLVQ 


[6] 


(R) 


(post) 


(U) 


[28] 


SCS 


[6] 


(R) 


(post) 


(U) 


[21] 


SOM 


[6] 


(R) 


(post) 


(U) 


[1] 


FCM 


[6] 


(R) 


(post) 


(U) 



Figure 2 depicts the 21 methods in Table 1 schematically. The Wilson/Flart (W-i-H) 
method is Wilson’s method followed by Hart’s condensed nearest-neighbor (C-nn), so 
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it does not fit nicely into the tree in Figure 2. The three methods bracketed by < > are 
not used in our numerical experiments, but are included here for completeness. 




Prototype Generator 
V={v,,...,v ) 

1 Hp' 




V = Selection 



Vp = Replacement 



(Pre-supervlsed) 



Retain mlsclassified 
objects (Condensing) 



Discard misclassified 
objects (Editing) 



<Hart - C-nn> 






<Wllson> 


Dasarathy - MCS 






<Multl-edit> 



Strategy-free 



Random search 
Tabu search 
GA search 



Pre-supervised Post-supervised 



Bootstrap 




DR 


Chang 




FLVQ 


DSM 




HCM 


LVQl 




GLVQ-F 


LVQTC 




MFCM(3) 


MCA 




VQ 


Sample Means 




SCS 

SOM 



Fig. 2. Methods for finding prototypes 



2 The 21 Methods 

We cannot give useful descriptions of the 21 models and algorithms shown in Table 1 
and Figure 2 here, so we briefly characterize the methods, and refer readers to [5, 6] 
and/or the original papers for more details. For convenience we drop the subscript of 
Xp, and refer to the training data simply as X unless otherwise indicated. 

Selection by condensation. Condensation seeks a consistent reference set p 
X such that E (X^ ; V ) = 0 . All condensation methods enforce the zero 

np tr s 

resubstitution error requirement, so trade-off between test error rate and the cardinality 
of is impossible. The original condensation method is Hart’s C-nn [7]. Many 

modifications of and algorithms similar to C-nn are known [8]. The output of C-nn 
depends on the order of presentation of the elements in X. Cerveron and Ferri [9] 
suggest running C-nn multiple times, beginning with different permutations of X, and 
terminating the current run as soon as it produces a set with the same cardinality as the 
smallest found so far. This speeds the algorithm towards its destination and seems 

to produce good sets of consistent prototypes. A minimal consistent set algorithm 
(MCS) for condensation was proposed by Dasarathy [10]. Dasarathy’s MCS decides 
which elements to retain after a pass through all of X, so unlike C-nn, MCS does not 
depend on the order in which the elements of X are processed. MCS, however, does 
not necessarily find with the true minimal cardinality [3]. 
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Selection by editing. Error-editing assumes that points from different classes that 
are close to decision boundaries should be discarded. Error-editing methods have no 
explicit connection to either the resubstitution or generalization error rate performance 
of the 1-np classifier based on the resultant . This group of methods include 

Wilson’s method [11] and Multiedit [12]. Both schemes are based on deleting 
misclassified objects. In Wilson’s method, the 3-nn algorithm is run once on X, and all 
misclassified objects are deleted from X after the run, leaving prototype set . 

Multiedit is asymptotically Bayes optimal, but is not suitable for small data sets with 
overlapping clusters, whereas Wilson’s method works well in these cases. We have 
found that the methods of Hart, Wilson, and Randomized Hart are not very effective 
in terms of either accuracy or data set reduction. The Wh-H (Wilson + Hart) method 
introduced in [5] is just Wilson's method followed by Hart's C-nn . 

Editing techniques are often followed by condensation. Editing “cleans up” the 
input data, yielding a initial supposedly contains only “easy” points in it. Then 
a condensation method reduces V . , to a possibly smaller number of relevant 

final prototypes, say . Eerri [13] proposes a third step : Multiedit is used for 

phase 1 “clean up” ; Hart's C-nn for phase 2 condensation; is then used to 

reclassify all the original points in X, and the newly labeled data set, say X', is used 
with the decision surface method (DSM) to further refine the classification boundary. 



Selection by search. A third group of methods for prototype selection attempt to 
find the smallest possible with the highest possible 1-np accuracy through 



criterion-driven combinatorial optimization. These methods are strategy free in the 
sense that the decision to retain prototypes is based entirely on optimizing the criterion 
function. The basic combinatorial optimization problem to be solved is: 



01^ (l 

VeP(X) V^eP(X)L' 




(4) 



where P(X) is the power set of X and a is a positive constant which determines the 
trade-off between accuracy and cardinality [3, 9] . We use three methods from this 
third group that all make use of (4) to evaluate potential solutions to the selection 
problem: random selection, GA-based search, and Tabu search. 

For random selection (RS), the desired cardinality n of V and the number of 

p ® 

trials T are specified in advance. Then T random subsets of X of cardinality n^ are 
generated, and the one with the smallest error rate is retained as . Skalak calls this 

method a Monte Carlo simulation [14]. Random search works unexpectedly well for 
moderate sized data sets [3], [14] . 

Editing training data with GAs has been discussed by Chang and Lippmann [15]; 
Kuncheva, [16, 17]; and Kuncheva and Bezdek [3]. Our GA model is close to random 
selection, and our computational experience is that a few runs of this simple scheme 
can lead to a reasonably good solution. An even simpler evolutionary algorithm for 
data editing called “random mutation hill climbing” was proposed by Skalak [14]. 
Instead of evolving a population of chromosomes simultaneously, only one 
chromosome evolves (should we call it survival of the only ?), and only mutation is 




Some Notes on Twenty One (21) Nearest Prototype Classifiers 7 



performed on it. The best set in T mutations is returned as . The evolutionary 

schemes in [3] and [14] are both heuristic. GA conducts a larger search by keeping 
different subsets of candidates in its early stages. On the other hand, the random 
mutation method is really simple, and, like the GA in [3], outperforms RS. 

Tabu search is an interesting alternative to randomized methods [9]. In this scheme 
the number of iterations T is fixed but the cardinality n is not. Similar to random 

P 

mutation hill climbing, TS operates on only the current solution S. A tabu vector of 
length |x| is set up with all entries initially zero. An entry of 0 in the k-th place in the 
Tabu vector indicates that can be added or deleted from S, while an entry greater 
than 0 prohibits a change in the status of . A parameter T^ called tabu tenure 

specifies the number of iterations before a change of any previously altered bit is 
allowed. An initial subset is picked as S, stored as the initial approximation of , 

and evaluated by J( ). All neighboring subsets to S are evaluated by J. The neighbor 

subset S that yields the highest value of J is called the winning neighbor. If 
J(V^) > J(V^) , S replaces S, regardless of the tabu vector, and and J( ) are 

updated. If J(V^) < J(V^) , the tabu vector is checked. If the move from S to S is 
allowed, it is made anyway, and the corresponding slot of the tabu vector is set to T^ . 

Thus, tabu search does not necessarily have the ascent property. All other non-zero 
values in the tabu vector are then decreased by one. Different criteria can be applied 
for terminating the algorithm. Cerveron and Ferri’s constructive initialization was used 
in [5], but we did not wait until a consistent set was obtained. Instead, the initial 
incremental phase was terminated at a prespecified number of iterations. 

Pre-supervised replacement. The oldest method in this group (maybe 200 years 
old) replaces crisp subset Xj in X with its sample mean v. = Z x / n . , where 

^ xgX. ^ 

n. = |x.|, i = l,...,c. Chang [18] gave one of the earliest pre-supervised algorithms 

for extracting R-prototypes. Chang's algorithm features sequential updating based on a 
criterion that has a graph-theoretic flavor. Bezdek et al. [19] proposed a modification 
of Chang's algorithm that they called the modified Chang algorithm (MCA). 
Hamamoto et al. [20] gave a number of bootstrap methods for the generation of R- 
prototypes. The Hamamoto method we used (called BTS(3)), requires choosing three 
parameters; the number of nearest neighbors k, the desired number of prototypes n 

P 

and the number of random trials T. A random sample of size n is drawn from X. Each 

P 

data point is replaced by the mean of its k-nearest neighbors with the same class label. 
The 1-np classifier is run on X using the new labeled prototypes. The best set from T 
runs is returned as the final V . In our experience, Hamamoto et al.'s method gives 

nice results. BTS(3) is a simple, fast, and unashamedly random way to get pre- 
supervised R-prototypes that often yield low 1-np error rates. 

Another basic design that can be used for prototype generation is the LVQl 
algorithm [21]. An initial set of n • c labeled prototypes are randomly selected from X 
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as initial prototypes so that each class is represented by at least one prototype. LVQl 
has three additional user-specified parameters: the learning rate (0,1), a 

constant r) (0,1) and the terminal number of iterations T. The standard competitive 
learning update equation is then used to alter the prototype set. Geva and Sitte’s DSM 
[22] is a variant of LVQl which they assert better approximates classification 
boundaries of the training data than LVQl does. These authors say the price for better 
classification rates is that DSM is somewhat less stable than standard LVQ’s. In 
LVQl the winning prototype is either punished or rewarded, depending on the 
outcome of the 1-np label match to the input. In DSM, when the 1-np rule produces 
the correct label, no update is made, but when misclassification occurs, the winner 
(from the wrong class) is punished, and the nearest prototype from the same class as 
the current input is identified and rewarded. 

Both LVQl and DSM operate with a fixed number of prototypes chosen by the 
user, so are user-n methods. An auto-n modification of LVQ that can prune and 

p p 

relabel prototypes was proposed by Odorico [23], who called it LVQTC. In LVQTC 
the winning prototype is updated depending on the distance to input and the history 

of the prototype. A prototype's historical importance is determined by the number of 
times it has been the winner, and the learning rate used for this prototype decreases as 
its hit rate increases. The rationale for this treatment of the learning rate is that 
prototypes which have been modified many times have already found a good place in 
the feature space and should be less affected by subsequent inputs. This strategy is 
very similar to one of the earliest competitive learning models, viz., sequential hard c- 
means [24]. Odorico may or may not have recognized this, but in any case adds some 
novel heuristics to the original algorithm which seem both justifiable and useful. 

Post-supervised replacement. Methods in this category disregard class labels 
during training, and use X without its labels to find a set V of algorithmically labeled 

prototypes. The prototypes are then relabeled using the training data labels. To assign 
physical labels to the prototypes, the 1-np rule is applied to X using the extracted 
prototypes. The number of winners for each prototype from all c classes are counted. 
Finally, the most represented class label is assigned to each prototype. This relabeling 
strategy guarantees the smallest number of misclassifications of the resultant 1-np 
classifier on X, and is used in all of our post-supervised designs. 

Vector quantization (VQ) is one of the standard sequential models that has been 
used for many years [25]. We adhered to the basic algorithm and applied it to each 
data set in the post- supervised mode. VQ starts with the user randomly selecting an 
initial set of n unlabeled prototypes from X. The closest prototype is always rewarded 

P 

according to the update equation v. = v . -I- (Xj^ - v . ) . The learning rate 

is indexed on t, the iteration counter (one iteration is one pass through X). We also 
used the closely related self-organizing map (SOM), which reduces to unsupervised 
VQ under circumstances laid out in [21]. Our runs with the SOM are discussed in 
more detail in [6], and some results using the SOM are traced out on Figure 5. 
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Generalized Learning Vector Quantization - Fuzzy (GLVQ-F) is an unsupervised 
sequential learning method for finding prototypes in which all c prototypes are 
updated after each input is processed. The update formula for the special case of 
weighting exponent m = 2 is [26] 



V. = V . , . + u .a Tx, - V . , , ) , with u . = Z 

i,new i,old it k i,old i . n 

j=i 



/ 


> 

1 


2^ 






2 




1 

< 




V 


J| 


J 



, 1 *• i •• c. 



(5) 



The rest of the GLVQ-F algorithm is the same as VQ. Limit analysis in [26] shows 
that GLVQ-F reduces to VQ under certain conditions. Yet another sequential 
competitive learning model used in [6] is the deterministic "dog-rabbit" (DR) 
algorithm [27]. Like GLVQ-F, the DR algorithm may update all c prototypes for each 
input. Unlike GLVQ-F, the DR algorithm is not based on an optimization problem. 
Rather, its authors use intuitive arguments to establish the learning rate distribution 
that is used by the DR model. 

The soft competition scheme (SCS) is a probabilistic sequential learning model that 
bears much similarity to algorithms for the estimation of the components of certain 
mixtures of normal distributions. Updates in SCS are made to all c prototypes, instead 
of just the winner [28]. The fuzzy learning vector quantization (FLVQ) model shares 
many of the same characteristics as SCS, and these two models are compared in [29]. 

Three unsupervised batch learning models were also used in [6]. If we disregard the 
labels of X, we can cluster it with any clustering algorithm that generates point 
prototypes, relabel the prototypes, and take them as V . Good candidates include the 

various the c-means methods [1]. Our experiments in [5] used only classical hard c- 
means [30], and we plot a point on Figure 5 that came from the modified fuzzy c- 
means (MFCM-3) algorithm of Yen and Chang [31]. 



3 The Data Sets and Numerical Experiments in [5] 

This sections summarizes the main points about 11 of the 21 methods (see Table 1), 
which used the four data sets shown in Table 2; see [5] for better descriptions of the 
methods, data and computational protocols. 

Four figures in [5] plot V) versus |v| = 11 ^ for the eleven 

methods and four data sets in Table 2. The closer a point is to the origin, 

the better the 1-np classifier, because such a classifier will have a smaller number of 
prototypes and also a smaller test error rate than classifiers that plot further from the 
origin. For example. Figure 3 (Figure 5 in [5]) has coordinates for 10 of the 11 
methods. The GA approach resulted in 77 prototypes for the cone-torus data, so we 
decided not to plot this point, to keep the scale so that the other 10 methods could be 
seen clearly. The same thing occurred with the other three data sets; the number of 
prototypes chosen by GA was much larger than those found by the other 10 methods, 
and this also occurred for the (W-i-H) classifier with two of the four data sets. 
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Table 2. Characteristics of the four data sets used in [5] 



Name 


P 

#of 

features 


c 

#of 

classes 


Kl 


FtestI 


Electronic Access 


Cone-Toms 


2 


3 


400 


400 


http://www.hangor.ac.uk/~mas00 

a/Z.txt 


Normal 

Mixture 


2 


2 


250 


1000 


http://www.stats.ox.ac.uk/~ripley/ 

PRNN/ 


Phoneme 


5 


2 


500 


4904 


ftp.dice.ucl.ac.be, directory 
pub/neural-nets/ELENA/ 


Satimage 


36 

(4 used) 


6 


500 


5935 


ftp.dice.ucl.ac.be, directory 
pub/neural-nets/ELENA/ 



LVQTC and BTS(3) are the Pareto optimal [PO, 32] designs in Figure 3, i.e., the 
ones that are better than all methods in some dimension, and not dominated by any 
other method in other dimensions. The tradeoff between minimal representation and 
maximal classification accuracy is evident from the fact that none of the classifiers 
studied in [5] had smallest coordinates in both dimensions. 



E (X, -V) 

np ' test ' 
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Figure 4 addresses the relevance of (C1)-(C3) to 1-np classifier design by 
associating each of the 1 1 classifiers reviewed in [5] with one of the eight possible 
coordinate triples that are possible in (Cl, C2, C3) space. The horizontal axis is the 
average ratio of n , the number of prototypes found or used, to the cardinality of the 

p 

4 



training set X^, = | I^(vy|x^^ .|). 



The vertical 



axis IS 



the average training 



error, = S ,;V. y4. The "best" designs are again closest to the origin, 

and the four Pareto-optimal designs for averages over the four data sets are captured 
by the shaded region in Figure 4. The coordinates of these four designs (LVQTC, 
Tabu, LVQl, BTS(3)) show ratios of: 3:1 for replacement vs. selection, 4:0 for pre- 
supervised designs vs. post- supervised designs, and 2:2 for auto-n vs. user-n 

P P 

selection of the number of prototypes.. Thus, averaging over the four sets of data 
changes only one ratio: the 3:1 best case ratio changes to 4:0 in the comparison of pre- 
to post- supervised designs. 



A 

19 - 



DSM 

(R.P.U) 



18h 




f (R.p.U)vQ 

(S.P.U) RND (R.P.U)GLVQ-F 
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4 The Data and Numerical Experiments in [6] 

Resubstitution errors E^p(X^;V) computed with 1-np classifiers built from runs 

using all 150 points in the Iris data [33] as are used to compare the subset of sixteen 
classifiers shown in Table 1 referenced as [6]. Using training error as the standard of 
comparison enables us to compare different consistent classifiers. 



Figure 5 is typical of the results in [6]. We see from Figure 5 that four classifiers 
are consistent: MCA with HR- prototypes; GA with 12 S-prototypes; Chang with 14 
R-prototypes; and MCS with 15 S-prototypes. There are two selection (GA, RS) and 
two replacement (MCA, BS) designs among the four consistent classifiers in Figure 
5. There are three Pareto optimal points from four designs in Figure 5 : RS and GA (2 
errors with 3 prototypes), BS (1 error with 5 prototypes), and MCA (no errors with 
1 1 prototypes). We itemize the characteristics of the Pareto optimal designs in Table 
3, along with their counterparts for the four data sets used in [5]. 



5 Conclusions 

What can we conclude about the three characteristics of 1-np designs ? 

(Cl) Selection (S) versus replacement (R); 

(C2) Pre-supervised [PRE] versus post-supervised (post); and 
(C3) User-n (U) versus auto-n (A). 

p p 

Column 1 of Table 3 shows the winning 1-np classifier designs from four figures in 
[5] and one figure in [6] for the five data sets used in these two studies. Each row has 
a set of 3 check (©) marks corresponding to the three characteristics. Since there are 
four Pareto optimal designs for the Iris data, each pair of columns in Table 3 has a 
total of 12 checks. The bottom row sums the checks in each column, and each pair 
gives us a rough measure of the relative efficacy of 1-np designs for each of the pairs 
of characteristics comprising Cl, C2 and C3. 

So, the ratio for selection vs. replacement is 1:2; for pre- vs. post supervision is 5: 
1; and for user vs. auto n , 1:1. This indicates that - at least for these data sets and 

p 

trials - pre-supervised, replacement prototypes are the more desirable combination of 
(Cl) and (C2), while finding the best number of prototypes is done equally well by 
user specification or "automatically". We conclude that: 

1. Replacement prototypes seem to produce better 1-np designs than points 

selected from the training data. 

2. Pre-supervision seems - overwhelmingly - to find more useful prototypes for 1- 

np classifiers than post-supervision does. 




Some Notes on Twenty One (21) Nearest Prototype Classifiers 



13 




3. Methods that "automatically" determine the best number of prototypes and 

methods that are largely based on user specification and trials-and-error are 
equally likely to yield good 1-np classifiers. 

4. There is a clear tradeoff between minimal n and minimal error rate, and this is 

p 

a data dependent issue. 

5. 1-np classifiers that use n > c, with c the indicated number classes in the 

P 

training data, will almost always produce lower error rates than 1-np 
designs that use one prototype per class. 
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Table 3. A summary of the Pareto optimal (PO) designs for the five data sets 



Method 


E (X ;V) 

np * 


PO in [5] 


(S) 


(R) 


[PRE] 


(post) 


(U) 


(A) 


LVQTC 


^test 


Fig. 5 




© 


© 






© 


BTS(3) 


^est 


Fig. 5 




© 


© 




© 




LVQl 


^test 


Fig. 6 




© 


© 




© 




RS 


^test 


Fig. 6 


© 




© 




© 




Tabu 


^test 


Fig. 7 


© 




© 






© 


LVQTC 


^test 


Fig. 7 




© 


© 






© 


HCM 


^test 


Fig. 8 




© 




© 




© 


VQ 


^test 


Fig. 8 




© 




© 


© 








PO in [6] 








RS 


X, ■ 

tram 


Fig. 6 


© 




© 




© 




GA 


X, ■ 

tram 


Fig. 6 


© 




© 






© 


BTS(3) 


X, ■ 

tram 


Fig. 6 




© 


© 




© 




MCA 


X, ■ 

tram 


Fig. 6 
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Abstract. This paper proposes a general framework for the develop- 
ment of a novel approach to pattern recognition which is strongly based 
on graphical data types. These data keep at the same time the highly 
structured representation of classical syntactic and structural approa- 
ches and the subsymbolic capabilities of decision-theoretic approaches, 
typical of connectionist and statistical models. Like for decision-theoretic 
models, the recognition ability is mainly gained on the basis of learning 
from examples, that, however, are strongly structured. 



1 Introduction 

As early pointed out by Wiener, a pattern can often be regarded as an arran- 
gement characterized by the order of the elements of which it is made, rather 
than by the intrinsic nature of these elements. Hence, the causal, hierarchical, 
and topological relations between parts of a given pattern yield a significant in- 
formation that seems to be useful in most human recognition processes. Among 
others, these motivations have given rise to the impressive development of ap- 
proaches to syntactic and structural pattern recognition. 

In the last three decades the emphasis on the research in the area of pat- 
tern recognition has hovered pendulum-like from decision-theoretic to structu- 
red approaches. Decision-theoretic methods are essentially based on numerical 
features which provide a global representation of the pattern by means of an ap- 
propriate pre-processing. Many different decision-theoretic methods have been 
massively experimented in the framework of connectionist models, which operate 
on sub-symbolic pattern representations. On the opposite, syntactic and struc- 
tural pattern recognition and, additionally, artificial intelligence-based methods 
have been developed which emphasize the symbolic nature of patterns. Since 
their main focus is on expectations that can be derived from previous knowledge 
of the components that have to be detected in the patterns under consideration, 
such methods are often referred to as “knowledge-based” methods. 

Such different approaches to pattern recognition have given rise to the lon- 
gstanding debate which takes place between traditional AI methods, based on 
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symbols, and computational intelligence methods, which operate on numbers. 
However, both purely decision-theoretic or syntactical/structural approaches are 
limited when applied to many interesting real-world problems for opposite rea- 
sons. 

It has been recently pointed out that traditional connectionist models con- 
ceived for processing static data types can properly be extended so as to deal 
with structured domains (see m for a survey on the topic). The basic idea is 
that the input graphs are processed by attaching a state variable to each node 
and performing a computation which is independent of the node. This assump- 
tion of independence gives rise to computational models which are also referred 
to as stationary models and represents at the same time the strength and the 
weakness of these models. 

In pattern recognition, the remarkable feature of these connectionist-based 
models is that data can be graphs with real- valued nodes. As a result, the nodes 
can contain typical real-valued features and the links among the nodes can ex- 
press the relationship among the pattern components. 

In this paper we review briefly the connectionist approaches for structu- 
ral domains and show how can they profitably be applied to different pattern 
recognition tasks. We discuss the extraction of appropriate pattern graphical re- 
presentations and present general ideas for the application to classification and 
retrieval. We emphasize the potential advantages with respect to either traditio- 
nal adaptive pattern recognition or to structural pattern recognition, but we also 
point out most severe limitations inherently related to the stationary assumpti- 
ons in the propagation of the states attached to the input graphs. The extension 
to non-stationary models is currently investigated which would allow one to pro- 
cess input graphs depending on the particular node and preserve the general 
connectionist-based training scheme These models are likely to improve 
significantly the chance to deal with very hard pattern recognition problems. 

This paper is organized as follows. In the next session we discuss the limita- 
tions of decision-theoretic and structural approaches so as to motivate the study 
of the proposed approach. In section 0 we show possible a possible extraction 
of data structure and briefly review the literature in the field. In section 0 we 
review briefly the basic ideas behind connectionist models for processing in struc- 
tured domains and in section 0 we discuss the application to different pattern 
recognition tasks. Finally, limitations and future investigations are discussed in 
section 0 

2 Decision-Theoretic vs Structural Approaches 

Syntactical and structural methods can nicely take the structure of patterns into 
account. The classical structural approaches to pattern classification consider 
each pattern class as a set of “strings” belonging to an appropriate grammar. 
Classifying a pattern then means matching such a string against the production 
rules of the grammar under consideration, to check if it is a valid string for 
such a grammar. This approach, for its inherent symbolic nature, and for its 
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inability to cope with corrupted instances of a class, is too little flexible when it 
comes to handling the noisy nature of patterns typically found in most real-world 
applications. 

This problem has been early recognized and faced in different ways, to in- 
corporate statistical properties into structured approaches to pattern recogni- 
tion. The symbols used for either syntactical or structural approaches have been 
properly enriched with attributes, which are in fact vectors of real numbers re- 
presenting appropriate features which are expected to allow for some statistical 
variability in the patterns under consideration. 

Error-correction mechanisms have been introduced to deal with either noise 
or distortions |Q. Additionally, symbolic string parsing has been extended using 
stochastic grammars to model the uncertainty and the randomness of the accep- 
ted strings PEI. In these approaches, a probability measure is attached to the 
productions of the grammar G and. Anally, accepted strings can be assigned an 
attribute representing the probability with which they belong to the class repre- 
sented by G. Lu and Fu m combined error-correction parsing and stochastic 
grammars to attain a better integration with statistical approaches. A very de- 
tailed survey on the incorporation of statistical approaches into syntactical and 
structural pattern recognition can be found in |29| . Likewise, related approaches 
to integrate Al-based and decision-theoretic methods can be found in 0. 

On the other hand, either parametric or non-parametric statistical methods 
can nicely deal with distorted patterns and noise, but are severely limited in 
all cases in which the patterns are strongly structured. The feature extraction 
process in those cases seems to be inherently ill-posed; the features are either 
global or degenerate to the pixel level. 

The renewal of interest in artificial neural networks started in the middle 
eighties m suggested that pattern recognition methods shift from the complex 
task of feature selection and extraction to the development of effective architec- 
ture and learning algorithms. In principle, neural networks should be capable 
of extracting themselves optimal features for classification during the learning 
process. Moreover, “learning from examples”, which is typical of connectionist 
models, does not require that assumptions on the data probability distribution 
be made and is conceived to approximate virtually any non-linear discrimination 
function. 

The field of neural networks, however, has now reached the maturity which 
is necessary to state that the previous belief is neither theoretically nor experi- 
mentally founded. There is evidence to claim that complex pattern recognition 
tasks, for instance highly structured, require architectures with many parame- 
ters to make the loading of the weights effective. This choice, however, makes 
generalization to new examples very hard |^. On the other hand, the adoption 
of architectures with few parameters, which would facilitate the generalization, 
results in a very hard loading of the weights. The nature of this problem is par- 
tially addressed in the critical analyses on connectionist models by Fodor and 
Pylyshyn PH and Minsky m- 
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3 Graphical Representations of Patterns 



The graphical representation of patterns has been extensively studied especially 
in conjunction with structural approaches to pattern recognition. In the last few 
years, most approaches have also incorporated grey levels and colors, thus gi- 
ving rise to an impressive number of techniques with different performance (see 
e.g. [12712012011 1)11 9l22ldl4l2l 1 dj . Amongst others, a possible way to create struc- 
tured representations is that of defining a relationship to generate the children 
of a given node. Different objects may be related to one another to give a scene 
its meaning. Such relationships can be of two kinds, the is-a relationship and 
the part-of relationship, which correspond to two fundamental operations, the 
combination of parts into wholes and the abstraction, respectively. 




Fig. 1. A graphical item and a possible graphical representation based on border 
relationships. 



As an example. Fig. [0 shows a possible way of creating a pattern graphical 
representation which is based on the border relationship. The construction, which 
is supposed to take place in a noisy-free environment, is based on a supersource 
that is chosen on the basis of the largest area of the building block. Consequently, 
the children are defined by finding the blocks at the border and then acting 
recursiveljQ. 

4 Neural Networks on Structured Domains 

Like data, the model can itself be structured in the sense that the generic variable 
Xi y might be independent of q^^Xj y, where, following the notation introdu- 
ced in ca, gj, ^ is the operator which denotes the A:-th child. The structure of 
independence of some variables represents a form of prior knowledge. 

^ An artificial set of pictures can be created using an attribute plex grammars. The 
creation of pattern recognition benchmarks to assess the experimental results of the 
methods described in this paper is currently under evaluation. 
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Basically, the knowledge of a recursive network yields topological constraints 
which often make it possible to cut the number of trainable parameters signifi- 
cantly. Let us consider a directed ordered graph so as for any node v one can 
identify a set, eventually empty, of ordered children ch[v]. For each node 

ch[v] 1 '^v I V, 0x) /I \ 

Y^ = g{X^,v,Oy). 

From the encoding network depicted in Figure 0we can see a pictorial repre- 
sentation of the computation taking place in the recursive neural network. Each 
nil pointer is associated with a frontier state which is in fact an initial 
state that turns out to be useful to terminate the recursive equation. The graph 
plays its own role in the computation either because of the information attached 
to its nodes or for its topology. A formal description of the computation of the 
input graph requires sorting the nodes, so as to define for which nodes the state 
can be computed first. In the literature, this problem is referred to as topological 
sorting A sort of data flow computation takes place where the state of a given 
node can only be computed once all the states of its children are known. To 
some extent, the computation of the output Yy can regarded as a transduction 
of the input graph U to an output Y with the same skeletoifl as U. These 
lO-isomorph transductions are the direct generalization of the classic concept 
of transduction of lists. When processing graphs, the concept of lO-isomorph 
transductions can considerably be extended to the case in which also the skele- 
ton of the graph is modified. Because of the kind of problems considered in this 
paper, however, this case will not be treated in this paper. The classification of 
DOAGs is in fact the most important lO-isomorph transduction for applications 
to pattern recognition. The output of the classification process corresponds with 
l^s, that is the output value of the variables attached to the supersource in the 

^ The skeleton of a graph is the structure of the data regardless of the information 
attached to the nodes. 





Fig. 2. Compiling the encoding network from the recursive network and the given 
data structure. 
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encoding network. Basically, when the focus is on classification, we disregard all 
the outputs Yy apart from the final values Y s of the forward computation. 

The information attached to the recursive network, however, needs to be in- 
tegrated with a specific choice of functions / and g which must be suitable for 
learning the parameters. The connectionist assumption for functions / and g 
turns out to be adequate especially to fulfill computational complexity require- 
ments. 

Let o be the maximum outdegree of the given directed graph. The depen- 
dence of node v on its children ch[v] can be expressed by pointer matrices 
Ay{k) G 72."’", k = l,...o. Likewise, the information attached to the nodes 
can be propagated by weight matrix By G 72"’’". Hence, the first-order connec- 
tionist assumption yields 



Like for list processing the output can be computed by means of Yy = a(C-Xy). 

The strong consequence of this graphical representation for / and g is that, 
for any input graph, an encoding neural network can be created which is itself 
a graph with neurons as nodes. Hence, the connectionist assumption makes it 
possible to go one step further the general independence constraints expressed by 
means of the concept of recursive network. The corresponding encoding network 
turns out to be a graph whose links arise either because of the graph topology 
or because of independence between variables or because of the connectionist 
representations of the functions / and g themselves. The encoding networks 
associated with equation^ and equation Elare depicted in Fig.^in the particular 
case of stationary models, in which the parameters are independent of the node 
V. Encoding neural networks turns out to be weighed graphs, that is there is 
always a real variable attached to the edges (weight). Note that the architectural 
choice expressed by equation 0can be regarded as a way to produce a multilayer- 
based map of the state which, however, transforms the input Uy by means of 
one layer only. Obviously, one could also adopt a multilayer-based architecture 
for implementing f{Xyfi[v],Uy,0x)- Likewise, the function g(Xy,0y) can be 
implemented by a multilayer perceptron. In Figure El this function is created 
by means of one layer of sigmoidal neurons only. Finally, in the framework of 
supervised learning, we can easily extend backpropagation to adjust the shared 
parameters of the encoding neural networks. The backpropagation takes place 
on neural nets which inherit the data structure and, therefore, the corresponding 
learning algorithm is referred to as backpropagation through structure. 

5 Facing Pattern Recognition Problems 

The term adaptive graphical pattern recognition was firstly introduced in E|, 
but early experiments using this approach were carried out in m. Graphs are 
either in the data or in the computational model, since the adopted connectio- 
nist models inherit the structure of the data graph and, moreover, have their 
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Fig. 3. The construction of a first-order recursive neural network from the encoding 
network. The construction holds under the assumption that the frontier states are null. 



own graphical structure which expresses the dependencies on the single varia- 
bles. Basically, graphical pattern recognition methods integrate structure into 
decision-theoretic models. The structure can be introduced at two different le- 
vels. First, we can introduce a bias on the map (e.g. receptive fields). In so doing 
the pattern of connectivity in the neural network is driven by the prior kno- 
wledge in the application domain. Second, each pattern can be represented by 
a corresponding graph. As put forward in the previous section, the hypothesis 
of directed ordered graphs can be profitably exploited to generalize the forward 
and backward computation of classical feedforward networks. 

The proposed approach can be pursued in most interesting pattern recogni- 
tion problems. 

— classification 

Recursive neural networks seem to be very appropriate for either classifica- 
tion or regression. Basically, the input structured representation is converted 
to a static representation (the activation in the hidden layer), which is sub- 
sequently encoded into the required class. This approach shares the advan- 
tages and disadvantages of related MLP-based approaches for static data. In 
particular, the approach is well-suited for complex discrimination problems, 
but it is not very adequate for verification purposes. Methods like growing 
and pruning can be successfully used for improving the learning process and 
make the trial and error approach more systematic. It’s worth mentioning 
that recursive nets can profitably be used for classification of highly struc- 
tured inputs, like documents using XY-tree representations. Unfortunately, 
in this particular kind of application the major limitation turns out to be 
that the number of classes is fixed in advance. More appropriate models in 
this case can be based on the preliminary ideas reported in HZ]. 
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— verification 

Neural networks in structured domains can be used in verification problems, 
where one wants to establish whether a given pattern belongs to a given 
class against any possible input. Unlike pattern classification, one does not 
know in advance the kind of inputs to be processed. It has been pointed out 
that sigmoidal multi-layered neural networks are not appropriate for this 
task m and, consequently, also recursive neural networks which represents 
the natural extension to the case of structured domains are not appropriate 
for verification tasks. However, like for multilayer networks, the adoption of 
radial basis function units suffices to remove this limitation. 

— retrieval 

The connectionist models introduced in this paper and the related exten- 
sions are very good candidates to deal with many interesting image retrieval 
tasks. In particular, the proposed models introduce a new notion of similarity 
which is constructed on the basis of the user’s relevance feedback. In most 
approaches proposed in the literature, the queries either involve global or 
local features, but disregard the pattern structure. The proposed approach 
makes it possible to retrieve patterns on the basis of a strong involvement 
of the pattern structure, since the graph topology plays a crucial role in the 
computation. On the other hand, since the nodes contain a vector of real- 
valued features, the proposed approach can also potentially exhibit retrieval 
capabilities which arise from the sub-symbolic nature of the patterns. 



6 Discussion 

The renewal of interest in neural networks raised an impressive interest in the 
community of pattern recognition, where many people were stimulated by the 
potential capabilities of these models to deal with any probability distribution. 
Nowadays, there is theoretical and practical evidence to conclude that these 
learning-based models cannot deal effectively with many interesting problems 
in which patterns exhibit a significant structure. In this paper, we claim that 
the new wave of connectionist models for processing in structured domains is 
likely to offer many opportunities to face new complex pattern recognition tasks, 
including the retrieval of images in visual data bases. Moreover, the basic ideas 
behind the extension of connectionist models to structured domains also apply 
to statistical models. There are, however, a number of limitations of the models 
reviewed briefly in this paper. 

— graph topology 

The proposed models operates on directed ordered graphs. This hypothesis 
makes it possible to carry out a forward computation and extend feedforward 
networks and backpropagation straightforwardly. When dealing with cyclic 
graphs, however, at the moment the only computational schemes that have 
been devised are based on and relaxation methods 
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— causality 

The proposed models represent a natural extension of the processing of se- 
quences by causal dynamical systems. In pattern recognition, the hypothesis 
of causality could be profitably removed, since there is no need to carry out 
an on-line computation at node level. 

— stationary 

The homogeneous computation which takes place at node level may not 
be adequate in many pattern recognition problems. This has been already 
pointed out in |2| , where a simple solution has been adopted to face the non- 
stationarity. The graphs are partitioned into different sets depending on the 
number of nodes and are processed separately. A more general and promising 
computational scheme has been devised in m, where the non-stationarity 
can be given a linguistic description. 

The development of a systematic theory of connectionist and statistical mo- 
dels to deal with structured domains is a new promising research field which has 
already yielded interesting results. However, the actual application of this theory 
to pattern recognition is still in its infancy and there are only a few preliminary 
encouraging results. A massive comparative application to classical pattern re- 
cognition problems is the only way to follow to assess the effectiveness of the 
proposed approach. 

Acknowledgements 

This paper is the result of the discussion with many people who contributed in 
different way to the theory of adaptive computation on structured domain and 
its applications. In particular we would like to thank Andreas Kiichler, Paolo 
Frasconi, Alessandro Sperduti, and Marco Maggini who contributed to shape 
most of the ideas for the application to pattern recognition herein presented. 

References 

1. A. Aho and T.G. Peterson, “A minimum distance error-correcting parser for context- 
free languages,” SIAM Journal of Computing, no. 4, pp. 305-312, 1972. 

2. H. Asada and M. Brady, “The curvature primal sketch,” in Proc. Workshop on 
Computer Vision (M. Gaudill and G. Butler, eds.), (Annapolis, MD), pp. 609-618, 
1984. 

3. D. Ballard, “Strip trees: a hierarchical representation for map features,” in Proc. 
of the 1979 IEEE Computer Society Conference on Pattern Recognition and Image 
Processing, (New York, NY), pp. 278-285, IEEE, 1979. 

4. D. Ballard, “Strip trees: a hierarchical representation for curves,” Communications 
of the ACM, vol. 24, pp. 310-321, 1981. 

5. E. Baum and D. Haussler, “What size net gives valid generalization?,” Neural Com- 
putation, vol. 1, no. 1, pp. 151-160, 1989. 

6. H. Bunke, “Hybrid pattern recognition methods,” in Syntactic and Structural Patten 
Recognition: Theory and Applications (H. Bunke and A. Sanfeliu, eds.), ch. 11, 
pp. 307-347, World Scientific, 1990. 



26 



G. Adorni, S. Cagnoni, and M. Gori 



7. Clarke, F., Ekeland, I.: Nonlinear oscillations and boundary-value problems for Ha- 
miltonian systems. Arch. Rat. Mech. Anal. 78 (1982) 315-333 

8. Clarke, F., Ekeland, L: Solutions periodiques, du periode donnee, des equations 
hamiltoniennes. Note GRAS Paris 287 (1978) 1013-1015 

9. M. Diligenti, M. Gori, M. Maggini, and E. Martinelli, “Graphical pattern recogni- 
tion,” in Proceedings of ICAPR98, 1998. 

10. C. Dyer, A. Rosenfeld, and H. Samet, “Region representation: Boundary codes 
from quadtrees,” Communications of the ACM, vol. 23, pp. 171-179, 1980. 

11. J. Fodor and Z. Pylyshyn, “Connectionism and cognitive architecture: A critical 
analysis,” Connections and Symbols, pp. 3-72, 1989. A Cognition Special Issue. 

12. P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive proces- 
sing of data structures,” IEEE Transactions on Neural Networks, vol. 9, pp. 768-786, 
September 1998. 

13. P. Frasconi, M. Gori, and A. Sperduti, “Integration of graphical-based rules with 
adaptive learning of structured information,” in Hybrid Neural Systems, Springer 
Verlag, R. Sun and S. Wermeter Eds, March 2000 

14. P. Frasconi, M. Gori, S. Marinai, J. Sheng, , G. Soda, and A. Sperduti, “Logo 
recognition by recursive neural networks,” in Proceedings of GREC97, pp. 144-151, 
1998. 

15. H. Freeman, “On the encoding of arbitrary geometric configurations,” IRE Trans., 
vol. EC-10, pp. 260-268, 1961. 

16. K. Fu, Syntaetic pattern recognition. Englewood Cliffs, NJ: Prentice-Hall, 1982. 

17. C. Goller and M. Gori, “Feature extraction and learning vector quantization for 
data structure,” in Proceedings of SOCO’99, (Genoa (Italy)), June 1999. 

18. M. Gori and F. Scarselli, “Are multilayer perceptrons adequate for pattern recogni- 
tion and verification?” , IEEE Trans, on Pattern Analysis and Maehine Intelligenee, 
vol. 20, no. 10, pp. 1121-1132, 1998. 

19. G. Hunter, Efficient Computation and Data Struetures for Graphics. PhD the- 
sis, Dept, of Electrical Engineering and Computer Science, Princeton University, 
Princeton, NJ, 1978. 

20. G. Hunter and K. Steiglitz, “Operations on images using quadtrees,” IEEE Tran- 
sactions on Pattern Analysis and Machine Intelligenee, vol. 1, pp. 145-153, 1979. 

21. S. Lu and K. Fu, “Stochastic error-correction syntax analysis for recognition of 
noisy patterns,” IEEE Transactions on Computers, no. 26, pp. 1268-1276, 1977. 

22. D. Meagher, “Geometric modelling using octree encoding,” Computer Graphies 
and Image Proeessing, pp. 129-147, 1982. 

23. Michalek, R., Tarantello, G.: Subharmonic solutions with prescribed minimal pe- 
riod for nonautonomous Hamiltonian systems. J. Diff. Eq. 72 (1988) 28-55 

24. M. Minsky and S. Papert, Perceptrons - Expanded Edition. Gambridge: MIT Press, 
1988. 

25. D. Rumelhart, J. McClelland, and the PDP Research Group, Parallel Distributed 
Processing: Explorations in the Microstructure of Cognition, vol. 1. Cambridge: MIT 
Press, 1986. 

26. P. Salembier and L. Garrido, “Binary partition tree as an efficient representation 
for filtering, segmentation and information retrieval,” in IEEE Int. Conference on 
Image Processing, ICIP’98, vol. 2, (Los Alamitos, GA), pp. 252-256, IEEE Comp. 
Soc. Press, 1998. 

27. H. Samet, “Spatial data structures,” in Modem Database Systems: The Object Mo- 
del, Interoperability, and Beyond (W. Kim, ed.), pp. 361-385, Reading, MA: Addison 
Wesley/ACM Press, 1995. 




Adaptive Graphical Pattern Recognition 



27 



28. A. Sperduti and A. Starita, “Supervised neural networks for the classification of 
structures” IEEE Trans, on Neural Networks, vol. 8, no. 3, pp. 714-735, 1997. 

29. W.-H. Tsai, “Combining statistical and structural methods,” in Syntactic and 
Structural Patten Recognition: Theory and Applications (H. Bunke and A. Sanfeliu, 
eds.), ch. 12, pp. 349-366, World Scientific, 1990. 




Current Trends in Grammatical Inference 



Colin De La Higuera 
EURISE, Universite de Saint-Etienne 

Erance 

|:dlh@univ-st-etienne.fr 



Abstract. Grammatical inference has historically found it’s first theoretical 
results in the field of inductive inference, but it’s first applications in the one of 
Syntactic and Structural Pattern Recognition. In the mid nineties, the field 
emancipated and researchers from a variety of communities moved in: 
Computational Linguistics, Natural Language Processing, Algorithmics, Speech 
Recognition, Bio-Informatics, Computational Learning Theory, Machine 
Learning. We claim that this interaction has been fruitful and allowed in a few 
years the appearance of formal theoretical results establishing the quality or not 
of the Grammatical Inference techniques, and probably more importantly the 
discovery of new algorithms that can infer a variety of types of grammars and 
automata from heterogeneous data. 



1. Grammatical Inference: The Last Few Years 

Whilst it is generally accepted that the first theoretical foundations in grammatical 
inference were given by M.E. Gold [3], the first usable grammatical inference 
algorithms were proposed inside the Syntactic Pattern Recognition community [2]. 
The algorithms were typically based on some syntactic property from formal language 
theory: iteration lemma, Nerode’s equivalence etc. Typical applications included 
classification and analysis of patterns, biological sequence classification. Character 
Recognition, etc. The main problem was that of dealing with positive data only (no 
counter-examples). The inability to cope with noisy data, or data that doesn’t fit 
exactly into a finite state machine led to defining classes of languages that were too 
complex, and thus, the good formal language theories were lost. 

The question of learning automata and grammars from strings was nevertheless a 
sufficiently general question to spread out of this community, and start being dealt 
with by researchers from other fields. D. Angluin, through several [1] used language 
learning as a central problem for Computational Learning Theory: the problem led to 
defining and studying different learning paradigms, as learning with the help of 
oracles. 

In 1989 L. Pitt [10] proved the difficulty of learning Deterministic Finite Automata 
(DFA) under a variety of learning paradigms; these negative results were strengthened 
in the following years by other researchers, proving that within the common PAC- 
learning framework, inferring DFA was simply too hard. 
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In 1993 the first, rather informal, ICGl[] meeting took place in Britain. It was 
followed hy ICGIs in Spain (1994), France (1996), USA (1998) and Portugal (2000) 
[9, 6, 5]. These meetings provided the researchers in the area with the occasion to 
meet and discuss new algorithms, paradigms and problems. Significant results have 
thus been obtained in the last few years [4] . 

Another factor that has enabled recent progress has been that of the availability of 
benchmarks, and large scale competitions [7, 8]. Also applications and data from a 
variety of other fields (Bio-Informatics, Speech Recognition, Time Series) have also 
put emphasis on developing new methods [11]. 



2. Some Recent Grammatical Inference Results 

Grammatical inference, otherwise referred to as grammar induction or language 

learning is now a specific research domain with it’s community, it’s conference, and 

some well established techniques. 

The advantages of using grammatical inference may be seen as the following: 

• The objects grammatical inference deals with are formal grammars or automata, 
with a well-studied theory that can be used to design new algorithms or to prove 
the convergence of these algorithms. 

• Grammatical inference is one of the few (reasonably) unbiased methods allowing 
to learn recursive concepts. Other techniques that could claim to do the same 
would be: 

- Inductive Logic Programming: yet in existing systems, the user has to declare 
which predicates he wants to be seen called recursively, and even then the 
recursion has to be fairly simple. 

- Neural Nets: a convincing comparison between these and formal grammar learning 
is still needed, but unlike automata learning the architecture of the neural network 
has to be given in advance (in the cases of the use of some long term memory, the 
length of this memory must also be given). 

- Hidden Markov Models are models closely related to stochastic finite automata. 
Again, the number of states of the desired HMM is a usual parameter of the 
learning/training algorithm. 

During the last few years one can point out the following new research directions and 

results^ 

• Using the Kolmogorov Complexity framework: after the failure to obtain positive 
results in the PAC-framework, research under the simple PAG framework has been 
pursued: the idea is that the learning examples are drawn according to a simple 
distribution, i.e. one where the simplest examples have higher probability than the 
most complex ones. After proving that well known algorithms allow for the DFA 
class to be simple PAG learnable, the framework has permitted the introduction of 
learning algorithms for new classes such as the linear grammars. 

• Adapting string learning algorithms to tree grammar learning. 



* International Colloquium on Grammatical Inference 

^ These can be found in the ICGI proceedings, or in the Machine Learning Journal special issue 
on Grammatical Inference, to appear, or on the Grammatical Inference homepage [4]. 
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• Finding new classes for which polynomial learning is possible: these can be either 
super classes of the regular languages when learning with counterexamples or 
subclasses when only positive examples are given. 

• Building incremental Grammatical Inference algorithms. 

• Using Genetic Programming, Taboo search, or Constraint Satisfaction techniques 
to find a solution. Once the combinatorics of the problem show that a greedy 
approach can only be successful in some cases. Artificial Intelligence heuristics 
can be of help. 



3. Open Problems and Key Issues 

Although in the last few (6 or 7) years a lot of novel results have been driven in the 

Grammatical Inference community, some crucial problems have either not been dealt 

with, or only in a yet very unsatisfactory way. Between these: 

• The need of algorithms that can deal with noisy data: the usual benchmarks the 
community has been using in the late 90s were concerned with learning large (500 
states) automata from positive and negative data. But in all cases this data has to be 
noise free: if one introduces even one incorrectly labeled string in the learning set 
for the top algorithms today, there is no chance of obtaining a correct solution, and 
it is plausible that the returned solution (if any) would be too large to be used. 
Several techniques have been used to deal with this problem: 

- Learning stochastic finite automata: the assumption here is not that the language is 
regular but that the distribution is. Several algorithms have been proposed since 

[9]. 

- Learning non-deterministic finite automata. The problem is difficult and little has 
been done [11]. 

- Reducing the problem to a graph coloring problem/constraint satisfaction problem. 
The idea is to use NP-problem solvers on the combinatorial task involved (one is 
really trying to find the smallest consistent grammar). 

- Using top down algorithms: nearly all the algorithms generalize a most specific 
grammar by state merging. It is well known that for a technique to be noise 
tolerant, working your way from a most general concept to a more specialized one 
is a better idea. 

• The need to build algorithms that learn context free grammars. It can be proved 
that DFA are more or less the largest class that can be efficiently learned by 
provably converging algorithms [6]. Yet of course, at least for practical reasons the 
class is insufficient. A lot of work has been made, with many different ideas 
(genetic algorithms, taboo search, maximum likelihood approaches) into learning 
context-free grammars. Obviously the negative theoretical results imply that no 
formal proof of the validity of these ideas can be given. Nevertheless it is clear that 
learning context free grammars is a real challenge for the community. One of the 
first steps should be the acceptance of some common benchmark or task. 

• Learn grammars and something else... Formal grammars have a nice 
generalization capacity, for only certain sorts of objects (in theory they can 
formalize a lot of things, but the size of a DFA representing a decision list, or some 
logical formula might be extravagant). Association of Grammatical Inference 
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techniques with other Machine Learning/ Syntactic Pattern Recognition procedures 
would be of mutual benefit. One could for instance: 

- Use grammatical inference algorithms in Inductive Logic Programming tasks[] 

- Combine symbolic grammatical inference techniques with neural nets 

- Combine Grammar Learning with other Machine Learning techniques, like 
learning decision trees and lists [6]. 
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Abstract. We consider an integrated approach to design the classification mle. 
Here qualities of statistical and neural net approaches are merged together. 
Instead of using the multivariate models and statistical methods directly to 
design the classifier, we use them in order to whiten the data and then to train 
the perceptron. A special attention is paid to magnitudes of the weights and to 
optimization of the training procedure. We study an influence of all 
characteristics of the cost function (target values, conventional regularization 
parameters), parameters of the optimization method (learning step, starting 
weights, a noise injection to original training vectors, to targets, and to the 
weights) on a result. Some of the discussed methods to control complexity are 
almost not discussed in the literature yet. 



1 Introduction 



A number of years a scientific discussion about preference of parametric and 
nonparametric classification rules is going on. Some scientists advocate that it is not 
wise to make assumptions about the type and parameters of the multivariate 
distribution densities, to estimate these characteristics from the training set data and 
only then to construct the classifier (see e.g., [15]). Instead of constructing the 
parametric statistical classifiers they advocate that much better way is to make 
assumptions about a structure of the classification rule (for example a linear 
discriminant function in a space of original or transformed features) and then to 
estimate unknown coefficients (weights) of the discriminant functions directly from 
the training data. This is a typical formulation of the classifier design problem utilized 
in artificial neural network (ANN) theory. Another point of view is characteristic to 
statistical pattern recognition. This approach advocates that making assumptions 
about the type of the distribution density function is a some sort of introducing a prior 
information into the classifier’s design process. In a case, when this additional 
information is correct, it can reduce the classification error [4]. Therefore, up to know 
the discussion between followers of both approaches is lasting on. 

An important recent result obtained in a channel of the ANN theory is a 
demonstration that in training, the non-linear SLP evolves: one can obtain several 
standard statistical classification and prediction rules of different complexity [9]. It 
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was shown that conditions E exist where after the first iteration of the gradient 
minimization (Back Propagation - BP) training algorithm performed in a hatch mode, 
one can obtain the well known Euclidean distance classifier (EDC). Eor this one needs 
to start training from zero initial weight vector, a mean of the training data should be 
moved into the center of co-ordinates. In further iterations, one can obtain a classical 
regularized discriminant analysis (RDA) and succeeding one is moving towards the 
standard linear Fisher classifier. If the number of dimensions exceeds the training set 
size then we are approaching the Fisher classifier with the pseudo-inversion of the 
sample covariance matrix. In further iterations, we have a kind of a robust classifier 
which is insensitive to outliers, atypical training set vectors distant from the 
discriminant hyper-plane. If the weights of the perceptron are large, we move towards 
a minimum empirical error classifier. If we have no empirical errors, we are 
approaching the maximal margin (the support vector) classifier. We can train the 
perceptron in a space of new features which can be obtained by nonlinear 
transformations of the original n features. Then the number of types of the 
classification rules can be increased. 

The evolution of the non-linear SEP can be utilized to integrate the statistical and 
neural net theory based approaches to design classification and prediction rules [10]. 
In this approach, instead of designing parametric statistical classifiers we use the 
training set based information (sample means, conventional, constrained, or 
regularized estimates of the covariance matrix) in order to whiten the distribution of 
vectors to be classified. Thus, we transform the original data into the spherical one and 
have a good initialization: after the whitening transformation and the first BP iteration, 
we obtain EDC which for the spherical Gaussian data is the best sample based 
classifier. In the original feature space, EDC is equivalent to the statistical classifier 
which could be obtained by utilizing “the training set based information” directly. It 
can happen that the parametric assumptions utilized to transform the data are not 
correct absolutely. Then in further perceptron training, we have a possibility to 
improve the decision boundary. In case we train and stop correctly, we can obtain an 
“optimal” solution. 

In order to obtain a full spectrum of the statistical classifiers we need to control the 
training process purposefully. The complexity of the classifier is very important factor 
that influence a performance. In the small training sample case, it is worth using 
simple classifiers, and in the large sample cases, it is preferable to use complex ones 
[6, 8]. Thus, a problem arises about how to choose a classifier of optimal complexity. 



2 The Perceptron and Its Training Rule 

In this paper we restrict ourselves with an analysis of the single layer and one hidden 
layer perceptron (SEP and MEP) classifiers used to obtain the classification rule in a 
two category case. We consider a simple back propagation training technique based on 
a gradient descent minimization procedure. We discuss an influence of all parameters 
of the cost function as well as the optimization procedure on the complexity of the 
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classification rule obtained. SLP has a number of inputs , X 2 , ■■■ , and an output 
o which is calculated according to equation 

0=f(V^X+H’o), (1) 

where /(net ) is a non-linear activation function, e.g. a tanh function 

f{net) = tanh(net) = (e -e )/(e H-e ), 

Wq, V^= (vi , V 2 , ... , v„) are weights of the discriminant function (DF), to be 
learned during training and denotes the transpose of vector V. 

To find the weights we have to minimize a certain cost function. Most popular is a 
sum of squares cost function 



1 

cost* = 

IVi + iV2 



2 

I 



Ni 

I ( 

y=i 



(0 



f{V x^i\v^)f + xVv 



(3) 



where is a desired output (a target) for X^‘\ j-th training set observation 

from class ro , Ni is the number of training vectors from class ro . 

The term X V^V is called a “weight decay” term, where is a positive constant 
called the regularization parameter. If activation function (2) is used then typically 

one uses t = -1 tj''= 1 either t = -0.8, t ® =0.8. Outputs of the activation function 

(2) vary in an interval (-1, 1). Therefore, the target values -1 and 1 we will refer as 
limiting ones, and values -0.8 and 0.8 - as “close”. The one hidden layer MLP 
considered in the present paper consists from one output neuron with the transfer 
function (1) and a number of hidden neurons with the same sort of the activation 
functions. In the gradient descent optimization, we find the weights in an iterative 
way. At a step t we update the weight vector according to equation 
dcost, dcostf 

Vit+i)=Vit) v„q+i)=v„q) (4) 

where q is a learning-step, and dcost, I dV is a gradient of the cost function. 

There exist a number of techniques used to control the complexity of ANN. 
Descriptions of these techniques are dispersed in a number of scientific journals 
devoted to Artificial neural networks. Statistical pattern recognition. Data analysis 
methods. Multivariate statistical analysis. A nice explanation and comparison of 
several methods (noise injection, sigmoid scaling, target smoothing, penalty terms, an 
early stopping and pruning the networks) can be found in [3, 7, 12, 13]. 

However, there exist more factors which practically participate in determining the 
network’s complexity and which likely are not well known to the connectionist 
community. We present a systematic analysis of all complexity control techniques 
which are used or can be used in SLP and MLP design. One more emphasis is that in 
preceptron training, several factors act simultaneously and in order to control the 
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network complexity the designer should take into account all of them. We consider a 
question how the parameters enumerated affect complexity, however, do not discuss a 
problem how to choose a set of their optimal values. 



3 The Number of Iterations 

In the nonlinear SLP training, with an increase in the number of iterations one can 
obtain seven statistical classifiers of different complexity (EDC, RDA, standard 
Fisher LDF, Pseudo-Fisher LDF, robust DA, minimum empirical error classifier, and 
the maximum margin (support vector) classifier [9]. Doubtless, the number of 
iterations affect the classifiers type while training MFP too. 



4 The Weight Decay Term 

Nowadays it is the most popular technique utilized to control the complexity of ANN 

[3] . Addition of the term XV^V to the standard cost function reduces the magnitudes 
of the weights. For very small weights the activation function acts as a linear one. 
Suppose, tj = -1 and t^= 1, we have the same number of training vectors from each 
class (Aj = Aj) we add the “weight decay” term XV^V and instead of the nonlinear 
activation function we use the linear one, i.e. o{net)=net. Then equating the derivatives 

(4) to zero and solving resulting equations we can show that the “weight decay” term 

guides to RDA. (see e.g. [11]). An alternative regularization term is + X(V^V- c^) . 

2 

Here a positive parameter c controls the magnitude of the weights and acts as the 
traditional regularizer. 



5 The Antiregulatization 

After starting from small or even zero values the weights of the perceptron are 
increasing gradually [9]. The nonlinear character of the activation function assists in 
obtaining the robust DA, the robust regularized DA, the minimum empirical error and 
the maximum margin classifiers. The magnitudes of the weights, however, depend on 
separability of the training sets (the empirical classification error). Small weights can 
prevent us to obtain anyone from the complex classifiers, such as the robust DA, the 
robust regularized DA, the minimum empirical error and the maximum margin 
classifiers. 

If the training sets overlap, the weights can not be increased very much. Thus, in 
highly intersecting pattern classes case, in order to be able to control a type of the 
classifier obtained at the end, we have to increase the weights artificially. For this 
purpose, we can subtract the weight decay term instead of adding it. Then we obtain 
large weights and begin to minimize the empirical classification error. This technique 
is called antiregularization. It was demonstrated that for non-Gaussian data 
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characterized by a number of outliers, the antiregularisation technique can reduce the 
generalization error radically. Often the traditional weight decay term destabilizes the 
training process. In this sense, the more complicated regularization term + X(V^V- c ) 
is more preferable. 



6 A Noise Injection 



A noise injection is one more very popular regularization factor. A noise can be added 
to inputs of the network, to outputs, to the targets, as well as to the weights of the 
network (see e.g. [2, 3]). 

6.1 A noise Injection to Inputs. In this approach, we add a zero mean and small 
covariance noise vectors to each training vector during every particular training 
iteration: 

X%=xf +n^, (^=l,2,...,f_), (5) 

where is the number of training sweeps. 

This technique is called also jittering of the data. While injecting noise m = 
times to each training vector we “increase” the number of observations m times. For 
the linear classification, we can find a link between a noise injection and RDA. Let the 
mean of noise vector n *1^ be zero and a covariance matrix be LI. Then the covariance 

matrix of random vectors (/? = !, 2, ..., m) will be Z. + LI, where Z. is a true 

covariance matrix of X ^ . Consider now a sample covariance matrix of a training set 

composed from vectors (J=l, 2, ..., N. ft = 1, 2, ... , m). When m— >oo, the new sample 
estimate of the covariance matrix tends to 




Nj j, 

— — I (z^;^ - )(z ) + LI. 

Ni -1 M ■' ■' ^ ^ 



( 6 ) 



The covariance matrix (6) coincides with the “ridge” (regularized) estimate used in 
RDA. Minimization of the mean square error criterion (3) and the weight decay term 
is equivalent to RDA. Consequently, when m — > °o , a noise injection technique also 
approximates the regularization approach using the weight decay. 

Adding a noise essentially amounts to introducing of a new information which 
states that the space between the training vectors is not empty. It is very important in 
MLP design, since the network (more precisely the cost function to be minimized) 
“does not know” that the pattern space between the training observation vectors is 
empty. A noise injected to inputs introduces this information. In practice, it is 
important to choose a right value of L, the noise variance. Similarly to RDA and 
smoothing in the Parzen window classifiers, an optimal value of L (the regularization 
parameter or the noise variance) depends upon the number of training vectors and a 
complexity of the problem. 
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6.2 A noise Injection to the Weights and to the Outputs of the Network. 
Adding a noise to the weights is a common practice to tackle local extreme problems 
in optimization. In addition to fighting the local minima problems, the noise injection 
to the weights increases the overlap of the training sets. Consequently, the noise 
injection to the weights reduces their magnitudes and acts as a tool to control the 
complexity of the classification rule. In the MLP classifier design, the noise injection 
into the weights of the hidden layer acts as a noise injection into inputs of the 
following layer. 

During the training process a noise can be added also to the outputs of the 
networks (or the targets). Such noise injection also reduces magnitude of the weights 
and influences a complexity of the classification rule. In all cases, an optimal value of 
a noise variance should be selected. Typically in situations with unknown data, the 
variance is determined by the cross validation technique. 

A drawback of training with the noise injection is that it usually requires to use 
small learning-step and many sample presentations, m, over the noise. In the high- 
dimensional space, this problem is particularly visible. 

6.3 A “Colored” Noise Injection to Inputs [5, 14]. As a rule, one injects to inputs 
a “white” noise. In the high dimensional classification problems, data is situated in 
nonlinear subspaces of much lower dimensionality than a formal number of the 
features. It means that for large X the spherical noise injection can distort a data 
configuration. In a k-NN directed noise injection, k-NN clustering should be used to 
determine the local intrinsic data dimension; the added noise should be limited to this 
intrinsic dimension. Then we have a minor data distortion, and in comparison with the 
spherical noise injection we obtain a gain. An alternative way is to use k nearest 

neighbors to calculate a singular sample covariance matrix around , and then 

to add a Gaussian N(0, X ) noise. Thus, instead of a “white” noise we are adding a 

“colored” one. Purposeful transformations of training images and signals performed 
prior to the feature extraction is a colored noise injection too. The optimal number of 
nearest neighbors k, and a noise variance X should be determined in an experimental 
way. 



7 Target Values’ Control 

Let us analyze the classification problem into two pattern classes with SLP trained by 
the sum of squares cost function and tanh activation function. Suppose targets tj and 
t 2 are close to each other, e.g. t^ = - 0.2 and ? 2 = 0.2. Then after minimization of the 

n 

cost function we obtain small values of the sums Gy = X + Vg and, 

a=l 

consequently, small weights. As a result, the weighted sums Gy are varying in a small 
interval around zero. Essentially, the activation function f(G) is acting as a linear 
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function. Hence, with the close targets we will not obtain the robust classification rule, 
the minimum empirical error and maximal margin classifiers. When tj — > -1 and 
1 the weights will increase. Then SLP will begin to ignore training vectors distant 
from the decision hyper-plane. Thus, the target values can be utilized as a tool to 
control the “robustness” of the classification rule. To obtain the minimum empirical 
error and the maximal margin classifier we need to use the target values very close to 
the limit activation function values (-1 and 1 for the tanh activation function). Values 
outside interval (-1, 1) assist in fast growth of the magnitudes of the weights, speed up 
the training process, however, increase a probability to be trapped into the local 
minimum. 



8 The Learning Step 

The learning-step r] affects the training speed. Small r| forces the weights to be small 
for a long time. Thus, while training the non-linear SLP the small learning step assists 
in obtaining a full sequence of the regularized classifiers and the linear discriminant 
function with the conventional or with the pseudo-inverse covariance matrix. The 
large learning step speeds up the weights growth, however, it can stop the training 
process just after the first iteration. If one fulfils the conditions E and uses very large 
r\, then after the first iteration one can obtain enormous weights. The gradient of the 
cost function can become very small, and the training algorithm can stop just after the 
first iteration. Hence, one gets EDC and does not move further. Large and 
intermediate learning step values can stop the training process soon after the RDA is 
obtained and does not allow to obtain RDA with small regularization parameter. The 
large learning-step leads to large weights immediately. They are increasing a 
possibility to be trapped into local mininima. 

If a constant learning step is utilized, then with an increase in the magnitude of the 
weights the training process slows down and therewith stops. One can say an effective 
value of ri decreases. In this case, the fixed value of the learning step can prevent the 
algorithm to get the minimum empirical error or the maximum margin classifier. One 
of solutions to overcome this difficulty is to use a variable learning step r, where after 
a certain number of iterations r] is either increased or reduced. In the back 
propagation training, in order to obtain the maximal margin classifier it was suggested 

t 

to increase learning step ri exponentially, e.g. r| = r]|, (1-ts) , where t is the iteration 
number and s is a small positive constant chosen in a trial and error way. The large 
learning step can prevent from obtaining the standard statistical linear classifiers, 
however, can allow to get the robust regularized discriminant function. A degree of 
regularization and the robustness can be controlled by the learning step value. 
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Fig. 1. Generalization (solid) and mean square (dots) errors as a function of the number of 
iterations t for different values of r|: 1 - q = 0.001, 2 - q =0.1, 3 - q =10, 4 - q = 1000. 

Example 1. Above, in Fig. 1 we had presented four pairs of learning curves “the 
generalization error (solid curves) and the empirical (training set) classification error 
(dots) as functions of the number of iterations t for different q values. In this 
experiment, we used two 12-variate Gaussian pattern classes with correlated features 
(p = 0.2), the Mahalanobis distance 5=3.76 (the Bayes error = 0.03), sigmoid 
activation function, the targets tj=0.1, t^=0.9, the training set composed from 10 H-10 
vectors. Both almost straight lines 4 show that for q = 1000 the adaptation 

process after the very first iteration stops practically. We have extremely slow (at the 
beginning) training when the learning-step is too small (curves 1 for q = 0.001). In this 
example, the best learning step’s value is q = 0.1 and results in the smallest 
generalization error (curve 2). 



9 Optimal Values of the Training Parameters 

In SLP training, the “weight decay”, the spherical noise injection, as well as early 
stopping can be equivalent to the classical RDA. In all four complexity control 
methods, we need to choose optimal values of X or the optimal number of iterations. It 
was shown theoretically that in RDA, the optimal value of the regularization 
parameter X depends on N, the training set size: X^^^^ decreases when N increases [11]. 
Unfortunately, analytical equations were derived for exactly known data model 
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(multivariate Gaussian with known parameters), for very small X values and cannot be 
applied in real situations. 

Example 2. In Fig. 2 we present average (over 20 independent experiments) 
generalization error values as functions of X, the conventional regularization 
parameter in RDA (a), the regularization parameter in the “weight decay” procedure 
(b), and the noise variance in the “noise injection” regularization (c). In this 
experiment, we used artificial 8-variate Gaussian GCCM data with strongly dependent 
features. 




Fig. 2. Three types of regularization of the SLP classifiers for two sizes of the training sets 
(N=N^=5 and 10): a - the regularized discriminant analysis with the “ridge” estimate of the 
covariance matrix, b - “weight decay”, c - “noise injection” to inputs. 

For each value of N, the number of training examples, all three pairs of the graphs 
approximately exhibit the same generalization error at minima points and explain by 
example the equivalence of all 3 regularization techniques. This experiments confirms 
theoretical conclusion: the optimal values of the regularization parameters decrease 
when the number of training examples increases. In SLP training, the number of 
iterations, the learning step, the targets, a noise, the weight decay term, the non- 
linearity of the activation function are acting together. An influence of one 
regularization factor is diminished by other ones. This circumstance causes additional 
difficulties in determining the optimal values of the complexity control parameters. 

The targets, the learning-step and even the number of iterations do not act in a 
visible way. Thus, they can be considered as automatic regularisers. In the neural 
network training, however, their effects can not be ignored. 

Considerations presented in this section are valid for the gradient type search 
techniques used to find the weights. Utilization of more sophisticated optimization 
techniques such as the second order Newton method or its modifications can speed up 
the training process, however, at the same time introduces definite shortcomings. For 
example, in a case of the linear pereceptron, the second order (Newton) method, in 
principle, allows to minimize the cost function just in one step, however, prevents 
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from obtaining RDA. In this sense, the BP algorithm is more desirable. Nevertheless, 
this shortcoming of the Newton method can be diminished by changing the target 
values, introducing the weight decay term either by a noise injection. 



10 Learning Step in the Hidden Layer of MLP 

In comparison with SLP, in the MLP training, several additional aspects arise. The 
learning-step is the first important secret factor that affects the complexity. In the 
gradient descent optimization procedure, a fixed value of the learning-step p causes a 
random fluctuation of the parameters to be optimized (in our case, the components of 
the weight vector). Asymptotically, for fixed p, and large t, the number of iterations, 
the weight vector oscillates around an optimal weight vector for this 

particular training set. In his classical paper [1], Amari has shown that asymptotically, 
for large t, the weight is a random Gausian vector N(V^ , pZ^,^ ), where is a 
covariance matrix that depends on data, a configuration of the network and 
peculiarities of the cost function near the optimum. In optimization, in order to find 
the minimum exactly the parameter p should converge to zero with a certain speed. 

Random character of the hidden units’ weight vectors V,,, suggests that the weights 
and outputs of the hidden layer neurons are random variables. Thus, we have a chaos 
( a process noise ) inside the feed-forward neural network. Random outputs serve as 
random inputs to the single layer perceptrons of the output layer. Consequently, in the 
output layer, we have a data jittering. A noise variance depends on the value of the 
learning-step p^ used to adapt the hidden units. 

Hence, in MLP training, the learning-step p^ in the hidden layer, the traditional 
noise injection, the weight decay term as well as the number of iterations are 
producing similar effects. Here, we remind once more that when the hidden units’ 
weights become large the nonlinear character of the activation function flattens the 
minima of the cost function, reduces the gradient and diminishes the variance of the 
noise injected to the output layer. Hence, we have an automatic reduction of the 
process noise variance. 

Example 3. The effects of regularization of MLP (MLP with 8 inputs, 8 hidden 
neurons and one output) performed by three different, but at the same time similar 
techniques: weight decay, jittering of the training data and by “jittering the inputs of 
the output layer: controlled by p^, the learning-step of the hidden neurons (eta hidden), 
are illustrated in Fig. 3. In this experiment, we use the artificial non-Gaussian data 
where vectors of the first class are inside the second class. The training sets size was 
A=140 (70 observation vectors from each class), and the test set Nt =1000 = 500+500. 
In Fig. 3 we have average values of the generalization error estimated from 7 
experiments. 
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Fig. 3. Three types of regularization of the MLP classifier: a - weight decay, b -uniform 
noise injection into the inputs, c - learning step of the hidden layer neurons. 

This experiment confirms the theoretical conclusion that the learning step r\^ used 
to train the hidden layer neurons have a similar effect as the weight decay and a noise 
injection. The parameter can be used to control the classifier’s complexity. At the 
same time, the learning step is the obligatory parameter in BP training. Hence, the 
learning step r\^ acts in inexplicit way and affects influences of other regularization 
tools like the weight decay parameter, the variance of a noise added to inputs, the 
target values and others. 



11 Sigmoid Scaling 



The SLP classifier performs classification of an unknown vector X according to a sign 
of the discriminat function g(X) = Vj Xj, V 2 X 2 , ... , v„x„ + Vq. Multiplication of all 
weights, Vi , V 2 , ... , v„, Vq, by a positive scalar a does not change the decision 
boundary. In the MLP classifiers, however, the effect of the multiplication of the 
hidden layer weights has more important consequences. A proportional increase in the 
magnitude of the hidden units weights forms the decision boundary with sharp angles. 
A proportional decrease in the magnitudes of all hidden layer weights smoothes the 
sharp angles and changes the complexity of the MLP classifier. Thus, a control of the 
magnitudes of the hidden layer weights is one of possible techniques which could be 
utilized to determine the classifier’s complexity. 

Example 4. In Fig. 4 we have 15 -i- 15 bi-variate training vectors from two 
artificially generated pattern classes where vectors from the first class are inside the 
second one and there is a ring-shaped margin between the pattern classes. An optimal 
(Bayes) decision boundary is depicted as a bold circle 3 in Fig. 4. a. 






Classifier’s Complexity Control while Training Multilayer Perceptrons 



43 





Fig. 4. a: 1- decision boundary of overtrained MLP, 2 - the smoothed by the factor 
a=0.53 decision boundary of the optimally stopped MLP, 3 - an optimal (Bayes) 
decision boundary, b - the generalization error as a function of the scaling parameter 
a: 1 - scaling of the hidden layer weights of the overtrained perceptron, 2 - scaling of 
the optimally stopped perceptron. 

After training MLP with 10 hidden units for 10000 iterations in a batch mode, we 
obtained a nonlinear decision boundary 1 with 7.6 % generalization error (for 
estimation we used 500+500 test vectors). In this experiment, we observed a 
significant overtraining: an optimal stopping resulted in two times smaller (3.4 %) 
generalization error. The proportional reduction of all hidden layer weights by factor 
a=0.35 smoothed the decision boundary and diminished the generalization error until 
1.5 % (see a behavior of curve 1 in Fig. 7). A notably better smooth decision boundary 
(curve 2 in Fig 4 a) was obtained after smoothing of the decision boundary of the 
optimally stopped MLP - the generalization error was reduced from 3.4 % of errors 
until 0.09 % (curve 2 in Fig 4 b with a minimum at = 0.53). Thus, in order to 
reduce the generalization error we had to optimize both the number of iterations t and 
the weights scaling parameter a. 

As a possible generalization of the weights scaling technique, the magnitudes of the 
weights can be controlled individually for each hidden unit. An alternative to the 
scaling of the hidden layer weights by factor a is the supplementary regularization 

T 12 

term ^hidden" ^ ) added to the cost function. Here the parameter h controls the 

magnitudes of the hidden layer weights 

At an end we summarize that the statistical hypothesis about data structure 
utilized to transform the data prior to train the perceptron highly affect the weights 
initialization, complexity and the performance of the SLP and MLP classifiers. All 
parameters of the cost function and the optimization procedure are acting together and 
jointly affect the complexity. While seeking for an optimal estimate of one of them the 
researcher should have in mind the values of other complexity control parameters. 
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Abstract. We consider the problem and issues of classifier fusion and discuss 
how they should be reflected in the fusion system architecture. We adopt the 
Bayesian viewpoint and show how this leads to classifier output moderation to 
compensate for sampling problems. We then discuss how the moderated outputs 
should be combined to reflect the prior distribution of models underlying the 
classifier designs. We then elaborate how the final stage of fusion should combine 
the complementary measurement information that might be available to different 
experts. This process is embodied in an overall architecture which shows why the 
fusion of raw expert outputs is a nonlinear function of the expert outputs and how 
this function can be realised as a sequence of relatively simple processes. 



1 Introduction 

In the two decades since the publication of the Devijver- Kittler text (Tj, Statistical Pattern 
Recognition has made significant advances. The following brief review of the progress 
made, serving as an introduction to the main part of this paper, is biased and idiosyn- 
cratic, presented merely to motivate the main discussion. For a detailed account of the 
achievements of the last twenty years the reader is referred to a recent review by Jain, 
Duin and Mao |’2D§. 

In statistical pattern classification the most notable progress has been made in the 
area of modelling probability density functions using a mixture of simple components, 
predominantly gaussians. The general approach is discussed in detail in H. Some inte- 
resting developments on the theme include the joint modelling of the background and 
class specific contributions to the mixture model | |l()| which provides useful information 
from the point of view of classifier design. One of the most important issues in model- 
ling pdfs by a mixture of components is architecture selection. The problem is that the 
usual goodness of fit criteria (Kullback-Leibler measure, maximum likelihood) monoto- 
nically improve as the number of components increases. The best architecture therefore 
has to be selected using alternative measures. One possibility is to check the recognition 
performance achieved with the resulting model by cross validation using another set of 
observations. However, this does not guarantee that the model is necessarily good, not 
mentioning that the cross validation is costly in terms of the amount of data that has 
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to be available. A more promising idea that has received a lot of attention is based on 
the minimum complexity principle (Ockham’s rasor). Accordingly, the simplest model 
that explains the data is the best. This has led to the use of penalised goodness of fit 
measures. Its simplest form proposed hy Akaike which imposed a prior distribution 
over all the possible models was rehned by a number of researchers making the penalty 
term dependent on the size of the training set i), dimensionality and the degree of data 
correlation a. However, while these methods protect from overfitting, they do not gua- 
rantee that the model will not be underhtted. In this respect a more promising approach 
to architecture selection is based on the idea of model validation recently proposed hy 

16J. 

Over the period the state of the art of the methodology for classifier design has 
been pushed signihcantly by other research communities. Notable advances have been 
made in the area of decision trees m,m and in neural networks m . The most recent 
development in machine learning. Support Vector Machines, is particularly exciting and 
stimulating [RJ. 

The last decade has also witnessed considerable advances in feature selection. The 
popular method of optimising feature selection criteria, the plus - 1, take away - r algo- 
rithm has been enhanced by making the numbers of forward and backward search steps, 
I and r , dynamic (data dependent). This computationally efficient algorithm [9J which 
is known as Floating search, has been found 1 1 2| | to be the most effective suhoptimal 
search method. Its recent further enhancements are the adaptive floating search method 
m and the oscillating search method m. 

The classical Branch and Bound search algorithm has been accelerated fT^ by 
prohling the effect on the criterion function value of feature knock-outs. The observed 
prohle can be used to make look ahead predictions which in turn are useful in guiding the 
search process into the most promising part of the search tree jI53. It has been also shown 
that feature selection can be performed as part of the process of data modelling using 
gaussian mixtures noim . The effectiveness of evolutionary optimisation approaches in 
feature selection has been demonstrated in III 811 71191 . The use of fused classiher error 
as a criterion for feature selection has been suggested in l 2( >1211 . 

Another area where significant advances have been made is classification in context. 
Conventional pattern classification involves a single object. However, objects usually do 
not exist in isolation. Other objects (neighbouring or otherwise) may convey contextual 
information that can be exploited in decision making. The motivation systematically 
to incorporate contextual information in the decision process led to the development 
of techniques which have close affinity to structural pattern recognition methods. De- 
pending whether the classihcation problem is formulated as message centered (joint 
labelling of all the objects) or object centered (labelling of a single object using context) 
the classihcation problem leads to either graph matching, or probabilistic relaxation. The 
Bayesian framework for the latter has been developed in and extended to handle 
relational information in E321- 

Last but not least, one of the most exciting directions of the last ten years has been 
classiher fusion. It has been recognised for some time that the classical approach to 
designing a pattern recognition system which focuses on hnding the best classiher has a 
serious drawback. Any complementary discriminatory information that other classihers 
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may encapsulate is not tapped. Multiple expert fusion aims to make use of many different 
designs to improve the classification performance. Over the last few years a myriad of 
methods for fusing the outputs of multiple classifiers have been proposed J34I35C61371 
l38i39l4()i4H42l43l4411 . The methods range from simple Bayesian estimations methods, 
through trainable multistage strategies where the outputs of component classifiers are 
considered as features and the fusion is performed by another classifier designed using 
independent data, to data dependent methods where each classifier has a domain of 
superior competence and their opinion is called on only when the observation falls into 
this domain. 

Notwithstanding all these advances it is pertinent to question their significance in 
view of the developments in Support Vector Machines ffl. This novel approach to trai- 
ning classifiers by minimising the structural risk enables the designer to position the 
class separating boundaries carefully so as to reduce the chances of misclassifying new 
patterns. The optimal positioning of the boundaries can be achieved for a given training 
set in the space of any dimensionality without feature selection. Surprisingly, it would 
appear that one can generate an arbitrary number of additional dimensions (features) 
without risking overfitting. This facilitates the construction of more effective, nonlinear 
boundaries between classes without compromising the ability to generalise. One could 
then make the inference that the design methodology should, at least in principle, lead 
to classifiers that capture all the discriminatory information in a single design. Seeking 
and fusing multiple opinions should thus be unnecessary. 

The above argument, in one bold sweep, would make the classical pattern recogni- 
tion system model and all the achievements of the last two decades out of date. When 
put to test in the context of an application, concerned with personal identity verifica- 
tion IT7ERI . the results were interesting, but not one sided. The problem was to verify 
the claimed identity of probe face images and for this purpose we used SVMs, simple 
Euclidean distance and normalised correlation classifiers. The experiments were per- 
formed in eigen face and fisher face spaces under various photometric normalisations. 
Interestingly, SVMs were able extract the relevant discriminatory information even from 
the eigenface representation, regardless of the sophistication of the photometric prepro- 
cessing or lack of it. However, once the discriminatory information was extracted by the 
traditional means of discriminant analysis, and the data suitably normalised, very simple 
classifiers outperformed the powerful SVMs. In fact SVMs did not benefit from these 
preprocessing steps at all. 

There are three important conclusions that can be drawn from the above findings: 



- SVMs are very effective in extracting discriminatory information from any repre- 
sentation and can successfully cope with complex intraclass variations 

- SVMs designs are not guaranteed to be superior to other carefully designed classifiers 
and therefore one can argue that fusion is still relevant 

- The use of knowledge about the problem domain and the data for the architecture 
selection is crucial to achieving successful designs 



With the rationale for continued interest in classifier fusion re-established it is perti- 
nent to ask whether classifier fusion systems should be designed using powerful machine 
learning methods, or whether a careful consideration to the issue of fusion architecture 
selection should be given in view of the third conclusion. In this paper we advocate the 
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latter. This point has already been argued in the context of neural net classifier design in 
E3- In this paper we consider the problem and issues of fusion and discuss how they 
should be reflected in the fusion system architecture. We adopt the Bayesian viewpoint 
and show how this leads to classifier output moderation to compensate for sampling 
problems. We then discuss how the moderated outputs should be combined to reflect 
the prior distribution of the models underlying the classifier designs. The final stage of 
fusion combines the complementary measurement information that may be available to 
different experts. This process is embodied in an overall architecture which shows why 
the fusion of raw expert outputs is a nonlinear function and how this function can be 
realised as a sequence of relatively simple processes. 

The paper is organised as follows. In the next section we introduce the theoretical 
model underpinning classifier fusion. The effect of averaging over classifier models 
is discussed in Section 0 Classifier output moderation is the subject of discussion in 
Section^ The resulting architecture is described in SectionEl SectionEldraws the paper 
to conclusion. 



2 Theoretical Framework 



Consider a pattern recognition problem where pattern Z is to be assigned to one of the 
m possible classes {wi, ....,Wm}- Let us assume that we make R vector observations 
Xi i = 1 , . . , i? on the given pattern and that the j-th measurement vector is the input to the 
i-th expert modality. We shall assume that these observations are provided by different 
sensors, or perhaps by the same sensor over a period of time. The different sensors can 
be either physical or logical. When the measurements are acquired by different physical 
sensors it is reasonable to assume that they will be conditionally statistically independent. 
However, for logical sensors we may not be able to make such a strong assumption. 
Indeed, there may often be the case that some of the components of the measurement 
vectors will be highly correlated or even identical copies. This could happen if the 
measurement vectors i = 1, R are formed from a larger pool of features by a 
selection with replacement which makes the features available for other classifier input 
vectors. This construction could result in some of the features to be shared. Logical 
sensors, of course, could also generate features that are weakly correlated. However, 
we shall not consider all the possible scenarios. Instead, we shall make the assumption 
that the components of one vector are either statistically conditionally independent from 
those of another, or they are exact replicas. 

In principle, vectors x^ could share different numbers of features. However, once 
again we shall not consider such complications as this would make our notation unne- 
cessarily complex. The analysis to be presented could easily be extended to any specific 
practical situation, if desired. Thus for the sake of simplicity, and without any loss of 
generality, we shall assume that the components of each pattern vector x^ can be divi- 
ded into two groups, forming vectors y and ^i, i.e. x^ = where the vector 

of measurements y is shared by all the R modalities whereas is specific to the i-th 
modality. We shall assume that given a class identity, the modality specific part of the 
pattern representation is conditionally independent from j ^ i. 
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In the measurement space each class uj^ is modelled by the probability density 
function p{yii\u!k) and its a priori probability of occurrence is denoted by P{uik)- We 
shall consider the models to be mutually exclusive which means that only one model 
can be associated with each pattern. 

Now according to the Bayesian theory, given measurements Xi,i = the 

pattern, Z, should be assigned to class ojj, i.e. its label 0 should assume value 9 = ujj, 
provided the aposteriori probability of that interpretation is maximum, i.e. 



assign 9 ^ ujj if P(0 = tUj|xi, ..,x/{) = maxP(61 = a>fc|xi, ..,Xij) (1) 

k 



Let us rewrite the aposteriori probability P{9 = ,^r) using the Bayes 

theorem. We have 



P{9 = Wfe|xi, ,xr) 



p(xi, ,Xr\0 = Wk)P(LOk) 

P(xr, ,Xfl) 



( 2 ) 



where p(xi , , x/j \9 = u>k) and p(xi , , xr) are the conditional and unconditional 

measurement joint probability densities. The latter can be expressed in terms of the 

conditional measurement distributions as p(xi, ,^r) = X)JLrF'(^i> >x/i|6* = 

ujj)P{ujj) and therefore, in the following, we can concentrate only on the numerator 
terms of O). 

We commence by expressing p(xi, , x/{|0 = iOk) as 



p(xr, , Xfl|6» = Wfe) = p(^i, , ^R\y, 9 = Wfe)p(y|6» = Wfe) (3) 

Recalling our assumption that the modality specific representations i = 1, ..,R are 
conditionally statistically independent, we can write 



p(xi, ,XR\9 = ujk) = [n^^p{^,\y,9 = u;k)]p{y\0 = ujk) (4) 



which can further be expressed as 



P(xr, 



,xji|6» = Wfc) = [77^1 



P{9 = ujk\y,Qp{y,^i) . P{uJk\y)p{y) 
P{^^k\y)p{y) P{uJk) 



and finally 



p(xi, ,Xr\9 = LOk) 



o P{9 = ujk\^i)p{yii) . P{i^k\y)p{y) 
^ P{u:k\y)p{y) ^ P{iOk) 



(5) 



(6) 



Let us pause to look at the meaning of the terms defining p(xi, , xr\9 = ojk)- 

Firsf of all P{9 = Wfejxi) is fhe k — th class aposteriori probability computed by each 
of the R classifiers whereas P{ujk\y) is k — th class probabilify based on fhe shared 
features. p{xi) and p{y) are the mixture measurement densities of the representations 
used for decision making by each of the experts. Since the measurement densities are 
independent of the class labels they can be cancelled out by the normalising term in the 
expression for the aposteriori probability in Q and we obtain the decision rule 

assign 9 ^ Uj if [7T« |y) = 
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which combines the individual classifier outputs in terms of a product. Each factor in 
the product for class ujk is normalised by the aposteriori probability of the class given 
the shared representation. 

Now let us consider the ratio and suppose it is close to one. We can 

then write P(0 = uJkl^i) = P{^k\y){^ + ^ki)- Substituting into 0 and linearising the 
product by expanding it and neglecting all terms of second order and higher, the decision 
rule becomes 

R 

assign 9 — >■ ujj if (1 — R)P{9 = uij |y) + ^ P{6 = Wj jx^) = 

i=l 

R 

= - R)P{9 = ujk\y) + '^P{0 = Wfc|xi)] (8) 

2=1 



3 Averaging over Models 

The probability level fusion embodied by rules (Q and ® involves the true aposteriori 
class probabilities. In practice these functions will be estimated by our experts and will 
be subject to errors. The design of the experts will normally be based on training data. In 
general we may have different training sets for each of the sensors. This will specially 
be true in the case when the different sensors are physical. A typical example here is the 
problem of person identification using biometrics involving voice characteristic, frontal 
face images, face profile, lip dynamics, etc. In all these cases the training data sets will 
be completely different. 

Let us consider one such sensor, say sensor i for which the available training set 
is Xi. An estimate P(w|xj) of P{uj\xi) that an expert can deliver will be influenced 
by the training set Xi. We shall represent this explicitly by writing for the estimate 
P{uj\xi) = P{u>\xi, Xi) The design of each expert will involve the choice of a model, 
M, and for each model we shall estimate its parameters represented by vector 7 ^. Hence 
an actual estimate will be conditioned on these two factors as P{uj\xi, Xi, M,'yi). It 
follows that if we wish to obtain as good estimate as possible, we should 

- consider as many models as possible, and 

- minimise the influence of a particular choice of model parameters. 

Mathematically this can be achieved by averaging over all the possible models and their 
parameters. This can be written as 

P(w|xi) = J J P{u\xi,Xi,M,^,)p{M)p{ji)dMd'y^ (9) 

where p(M) andp( 7 i) are the distributions of models and model parameters respectively. 
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Itl 0 the integration over the parameter space is referred to as moderation and it will 
be discussed in detail in Section 0 The integral over the model space can be estimated 
as 

1 t 

/ P(w|xi,Xi,Mj,7i)p(7i)(i7i (10) 

j=i J 

Denoting P{uj\xi, Xi, = Pj{uj\xi) and its integral over ji as 

Pj(w|xj) = y Pj(w|xj)p(7i)d7i (11) 

we finally obtain an estimate of the aposteriori class probability based on sensor i as 

1 

P{^\^i) = ( 12 ) 

* „_1 



4 Expert Output Moderation 

In Section El we argued for a moderation of raw expert outputs. The moderation is 
warranted for pragmatic reasons, namely to minimise the veto effect of overconfident 
erroneous classifiers. However, as has been pointed out in (II II there is also a good 
theoretical basis for moderating expert estimates of class aposteriori probabilities. The 
argument goes as follows. The estimation of the aposteriori class probability Pj(w|xi) 
by the j-th expert using the output Xi of sensor i is dependent on the training set Xi 
of data collected from sensor i. For a particular model which underlies the design of 
expert j the training data is used to estimate the model parameters 7^ . Note however, 
that the estimated parameters cannot be considered as the true parameters of the assumed 
model. By taking into account the distribution of the model parameter vector it should 
be possible to obtain a more conservative estimate of the decision output which will 
reduce the risk of overfitting to a particular training set. 

Mathematically, expert output Pj(tu|xi) is derived by integrating parameter depen- 
dent estimates Pj{uj\xi) = Pj{uj\xi, ^ij) over the model parameter space as 

Pj{uj\xi) = j Pj(w|xi,7y)p(7y)d7„ (13) 

Under the assumption that in the observation space the classes are distributed normally, 
the moderation converts the Gaussian density into Student’s t distribution . Bed worth 

1261 used this result to derive moderated posterior class probabilities for multilevel fusion 
and showed, on the standard UCI repository of classification problems , that for small 

training sample sets the results obtained by moderation are superior to non moderated 
expert output fusion. 

It is perhaps true to say that a Student’s t distribution converges to the corresponding 
Gaussian quite rapidly and for training sets of reasonable size there should not be any 
appreciable difference between moderated and raw expert outputs. However, for some 
types of classifiers, moderation is pertinent even for sample sets of respectable size. An 
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important case is the fc-Nearest Neighbour (fc — NN) classifier. Even if the training set is 
relatively large, say hundreds of samples or more, the need for moderation is determined 
by the value of k, which may be as low as fc = 1. Considering just the simplest case, 
a two class problem, it is perfectly possible to draw all fc-Nearest Neighbours from the 
same class which means that one of the classes will have the expert output set to zero. In 
the subsequent (product) fusion this will then dominate the fused output and may impose 
a veto on the class even if other experts are supportive of that particular hypothesis. 

We shall now consider this situation in more detail. Suppose that we draw fc-Nearest 
Neighbours and find that k of these belong to class uj. Then the unbiased estimate 
Pj{ui\xi) of the aposteriori probability P(w|xi) is given by 

Pj(w|xi) = ^ (14) 

It should be noted that the actual observation k out of k could arise for any value of 
P(w|xi) with the probability 

q{n) = OP^{uj\K,)[l-P{co\^0f-^ ( 15 ) 

Assuming that a priori the probability P(w|xi) taking any value between zero and one 
is equally likely, we can find an aposteriori estimate of the aposteriori class probability 
P(w|xi) as 



^ />(c^|xi)P-(u;|xi)[l-P(a;|xi)]^-«dP(a;|xO 

P,(w Xi) = — = ( 16 ) 

/g P«(w|xi)[l - P(o;|xi)]'=-'=dP(w|xi) 

where the denominator is a normalising factor ensuring that the total probability mass 
equals to one. By expanding the term [1 — Pjwjxi)]^”” and integrating, it can be easily 
verified that the right hand side of becomes 

which is the beta distribution. Thus the moderated equivalent of ^ is . Clearly our 
estimates of aposteriori class probabilities will never reach zero which could cause a 
veto effect. For instance, for the Nearest Neighbour classifier with fc = 1 the smallest 
expert output will be |. As fc increases the smallest estimate will approach zero as 
and will assume zero only when k = oo. 

In multiple expert fusion involving fc-NN classifiers, moderation can play a very 
important role. This has been demonstrated in [|4?| where the performance of a fused 
system involving a product fusion rule improved dramatically. 

5 Fusion System Architecture 

The discussions in Sections |2l 0 and 0 lead to an architecture shown in Figured] The 
set of measurements from each physical or logical sensor is the input to a battery of 
classifiers deploying different models for the computation of raw expert outputs. The 
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meaning of different models is understood in a broad sense, including not only diffe- 
rent distributional models, but also variations on each of the models considered. These 
variations can be realised by bagging, by using different initialisations of the respective 
learning algorithms, choosing different classifier design parameters and architectures. 
Each classifier is assumed to generate the aposteriori probabilities for each of the m clas- 
ses. These raw outputs are first moderated before they are combined to produce a better 
estimate of the class posteriors. Note that the importance of moderation will depend on 
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the severity of the sampling problem and the degree of averaging. In principle, the more 
extensive the averaging, the less important the moderation. The combined moderated 
expert outputs for each of the sensors are then fused to reach a final decision. The fusion 
can be accomplished by the product rule or sum rule as discussed in Section|21 

Note that the architecture in Figure[I]is very general. If there is only one sensor, then 
the final result will be obtained just by moderation and averaging. The averaging over 
different models can take into account the estimation errors by associating weights with 
each of the moderated outputs. 

6 Conclusions 

We argued that, in spite of the recent advances in machine learning based on the concept 
of Support Vectors, the conventional approaches to classifier design, including feature 
selection, contextual classification and classifier fusion retain their relevance. We consi- 
dered the problem and issues of classifier fusion in more detail and discussed how they 
should be reflected in the fusion system architecture. We adopted the Bayesian viewpoint 
and showed how this led to classifier output moderation to compensate for sampling pro- 
blems. We then discussed how the moderated outputs should be combined to reflect the 
prior distribution of models underlying the classifier designs. We then elaborated how 
the final stage of fusion should combine the complementary measurement information 
that might be available to different experts. This process is embodied in an overall ar- 
chitecture which shows why the fusion of raw expert outputs is a nonlinear function of 
expert outputs and how this function can be realised as a sequence of relatively simple 
processes. 
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Abstract. An application of combined statistical and structural-syntactic 
approach in Chinese character recognition is presented. The algorithm adopts a 
structural representation for Chinese characters, but in the classification and 
training process, the structural matching and parameter adjustment is conducted 
in a statistical way. Different from the conventional structural approaches, in 
this system, only a few predefined “knowledge” is required. In most cases, 
knowledge acquisition is simplified to “memorization” of examples, and the 
parameters for classification can be refined using statistical training. In this way 
it avoids the main difficulties inherent in the implementation of classification 
systems based on structural features. Compared with conventional statistical 
algorithms, the algorithm is based on a structural model of image patterns, so it 
has approximately all the advantages of structural pattern recognition 
algorithms. A prototype system has been realized based on this strategy, and the 
effectiveness of the method is verified by the experimental results. 



1 Introduction 

As the situation in general area of pattern recognition, in Chinese character 
recognition”'^'*’, there are two main trends: the statistical approach and the structural- 
syntactic approach. Though the combined statistical and structural approach has been 
widely used in online handwritten character recognition and offline Latin handwriting 
recognition, in the area of offline Chinese character recognition, due to the 
complexity of the problem, it is very difficult to apply the similar ideas. In this paper, 
a combined statistical and structural approach is proposed for offline Chinese 
character recognition. 

Since there are many advantages associated with the statistical approach, such as a 
simple and objective algorithm structure, based on firmly established elements of 
statistical decision theory, easy to train, etc., the statistical approach has been used 
widely in practical systems. 

Unfortunately, the statistical approach does have some disadvantages. For 
example, they are vulnerable to: 

• Complex backgrounds, such as in the case of video information retrieval. 
Statistical algorithms usually require pre-segmentation of object-of-interest from 
backgrounds; 
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• Deformable pattern recognition tasks. For highly deformable patterns, it is 
difficult to train a statistical classifier to recognize all the deformed versions of image 
patterns with high fidelity. 

On the other hand, the structural-syntactic approach has advantages in the 
above aspects, but its disadvantages are usually overwhelming. Some of them are: 

• Designing a successful structural-syntactic “mechanism” is difficult — only human 
beings have this mechanism, and even human heings do not know how to realize this 
mechanism in a computer; 

• Existing implementations are very complex, and hard to learn — just as above, 
only human beings know how to learn. 

If we can combine these two approaches, the advantages of both approaches can he 
utilized to enhance the accuracy and robustness of the classification systems, while 
reduce the difficulties in implementations. 

In this paper, a novel algorithm for recognition of Chinese characters is proposed 
using combined statistical and structural-syntactic approaches, and a prototype 
system, which can classify over 7,000 Chinese character, is realized based on this 
algorithm. The new algorithm has the following features: 

1. A simple, objective algorithm structure, which is typical of the statistical-based 
approach; 

2. Learning is accomplished by statistical training; 

3. Structural representation of the image patterns; 

4. Except for a little pre-defined knowledge, knowledge acquisition is simplified to 
“memorization” of examples. 

We will describe the whole strategy and algorithm in section 2 to section 4; section 
5 presents some experimental results using our prototype system; and finally, in 
section 6, some conclusions are given. 



2 Knowledge Representation 

One essential feature of the algorithm proposed here is that it is a hybrid of 
structural representation of image patterns and a statistical method for classification 
and training. The structural representation of image patterns is important since it 
endows many advantages, as described above, to the classification system. 



2.1 Internal Structural Representation of Chinese Characters 

A well-known structural representation of Chinese characters decomposes 
characters into radicals, and radicals into strokes'*'", as shown in Eig.l. 

The reason for adopting this hierarchical representation of each character is that it 
is concise and is similar with the conceptual representation of Chinese characters by 
human beings. 
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Chr: Character Rad: Radical St: Stroke 
Fig. 1. The stmctural representation of Chinese characters. 



2.2 Knowledge Acquisition by a “Memorization” Process 

In the conventional structural-syntactic approaches, knowledge acquisition is a 
very complex and difficult task. If we can replace knowledge by the memory of some 
quantities, then the implementation of the classification system will become much 
easier. 

In this paper, we use the online method'^'*^' to input Chinese characters — we call 
these characters as “examples”, and several hundreds of radicals, to the classification 
system. After the input process, each “example” is recognized and decomposed into 
radicals, and each radical is decomposed into strokes automatically, by using the 
online character recognition algorithm developed by our Lab. The precision of this 
automatic classification and decomposition process can be very high if some 
constraints for character input are used. Then, the position and size of each 
decomposed radicals and strokes are stored in a database. We call this process the 
“memorization” process. In our system, only one example of each character is 
necessary. 

Of course, only this memorized examples is not enough for classifying handwritten 
Chinese characters. A few “knowledge” is still required. We call these “unreducible” 
knowledge the “primitive” knowledge. 



3 The “Primitive” Knowledge 

As pointed out in the last section, pure memorized quantities may not be enough 
for the complex perception tasks. In major factors of human perception processes, 
there are surely something that should be called “skills”. We use the “primitive” 
knowledge to represent these skills. 

In this paper, the “primitive” knowledge include a definition of standard strokes, 
which are in the lowest hierarchy in the structural representation of Chinese 
characters, and the ability of the system to locate and recognize these standard strokes 
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from complex backgrounds. Though this is a difficult task itself, hut compared with 
the direct recognition of thousands of Chinese characters, each with different, complex 




Fig. 2. Examples of the online-input characters. 




Fig. 3. Examples of the online-input radicals. They have been 
decomposed into strokes, as marked by “o”. 

shapes, this unreducible knowledge requirement is much easier to meet. Our source 
codes for this task is less than 1,000 lines. 



3.1 Stroke Descriptions 

First, we define 17 standard strokes as the basic components of Chinese 
characters and radicals. These standard strokes are listed in Table 1. 



Table 1. The standard strokes. 























u 


< 















For each standard stroke, a context-dependent model for the stroke shape, 
namely shape generation model (SGM) is proposed. In SGM, we represent each 
stroke by a directed state transition diagram, as shown in Fig. 4. 

By modeling the strokes using the directed state transition diagrams, we assume 
that each stroke is composed of several line segments in the character image. We call 
these line segments image cells. We will describe how to segment a character image 
into image cells briefly in section 5. The image cells in a particular stroke can be 
connected or disconnected. The relation “connected” is represented by a transition 
between the two states that correspond to image cells. The “staying on” transitions, 
which are represented by self-loops on each state, are incorporated to compensate 
uncertainties in the segmentation of basic image cells, especially for arcs with great 
shape distortions which are common in handwritten characters. 
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The main attributes of the states in the state transition diagram are the directional 
properties of the image cells that are components of a particular stroke. The transition 






Fig. 4. The state transition diagram representation of shape generation models. I is the initial 
state, E represents the ending state. 1 ,2, . . . ,n are intermediate states. 



rules, which guide the transitions between states, describe the constraints in a particular 
shape, and also include possible distortions and disturbances, such as broken strokes. 

The advantage of this shape generation model, as compared with the conventional 
static shape matching methods for stroke extraction, is that it deals with arcs, which 
are common in handwritten character shapes, and various distortions, including those 
caused by the skeleton extraction operation more effectively. 

In Fig. 4, state “I” represents the initial state, which are initial image cells 
corresponding to the detected attention focus areas (AFA’s), which are described in 
the next section. Since initial image cells need not necessarily correspond to the 
starting or terminating line segments of strokes, in the shape generation process they 
are extended in two directions under the guidance of transition rules. In the following 
sections, we also use initial cells to refer to initial image cells or AFA’s. 



3.2 A Selective Attention Approach for Stroke Extraction 

Based on the SGM’s of strokes defined as above, we propose the following 
selective attention approach for stroke extraction from complex backgrounds. 

The algorithm is composed of two steps: First, attention focus areas, i.e. AFA’s are 
detected; second, based on SGM’s of strokes, a dynamic growing procedure is applied 
to group image cells into candidate 2D stroke shapes. 

Various features can be used to detect AFA’s. In experiments, we have simply 
used positional and directional relations between image cells, as well as the 
directional features of individual image cells to detect AFA’s. Other features can also 
be used. This flexibility of applying multiple features to detect multiple AFA’s 
guarantees the reliability and robustness of the shape extraction result, and is one 
advantage of the selective attention approach proposed here. 

It is well known that the accuracy of stroke extraction is limited due to noises and 
uncertainties, no matter how much knowledge are used. In fact, we do not expect our 
algorithm to extract correct strokes all the time, we only require it to extract possible 
strokes, based on simple shape criterions. It is essential for the algorithm to extract all 
the possible strokes, as shown in the experiments, so we have relaxed some criterions 
to take into account some confusing conditions. 
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4 Statistical Approach for Structural Matching 

Except applying the algorithm of section 3 to extract candidate strokes, other 
process of character recognition should be conducted in a straightforward and 
objective way, based on the memorized examples. 

An optimal structure-matching algorithm for Chinese characters based on various 
statistical quantities describing the structural features is proposed in the following. In 
the structural matching procedure, the stroke components of the memorized examples 
are compared with their possible counterparts extracted from the character images. For 
this purpose, we need a statistical model to make comparisons and decisions. 



4.1 The Statistical Model 

In our algorithm, various statistical measurements of structural features are 
obtained and described using probability models. An example is given in Fig. 5. We 
measure the size of individual strokes using the quantities H,W, and relative position 
of stroke 1 and stroke2 using LH and LW, respectively. Each of these quantities are 
Gaussian mixtures-modeled, and the parameters of these models, which are initially 
determined empirically based on the “examples”, can be refined by training, as 
mentioned in the following sections. 




Fig. 5. An example of statistical measurements of sizes and relative positions of strokes. 



4.2 Structural Matching 

The matching between extracted strokes and those of the prototype examples is 
typically a many-to-one matching problem, so some effective search methods have to 
be applied to determine the optimal structural matching between strokes. 

Compromising between accuracy and speed, we apply the beam-search method to 
select the optimal combinations. The procedure is: 1. Extract the candidate strokes 
from the input image. 2. Compute the statistical quantities of extracted strokes. 3. 
Compare these quantities with the corresponding data of the prototype models, i.e., 
the measurements obtained from the “examples”. 

In this process, the strokes are added one by one, and at each step, only the top N 
combination of strokes with the highest matching scores are reserved until the next 
step. N is selected empirically. In this way we are able to control the computational 
complexity, while still maintain accuracy in most cases. 
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The procedure is conducted from one radical to another. We regard the extracted 
radicals as “macro-strokes”, i.e., compute the statistical quantities of these radicals the 
same way as we do for the ordinary strokes. Based on this “extended” stroke model, 
the structure matching process proceeds according to the procedure described above. 



4.3 Learning: Training 

One advantage of adopting the statistical approaches for structural matching is that, 
the learning process can be easily realized using statistical training, a conventional 
procedure in statistical pattern classification. 

In our system, the data of the pre-memorized “examples” are used as initial 
parameters, and statistical measurements of the size/position of structural features 
from the input images are accumulated. These data are used to adjust the parameters 
for structural matching, such as the parameters of Gaussian mixtures. This supervised 
training is a typical EM process. 



5 Simulations 

We are interested to test the simple structure of the proposed strategy in some 
typical cases that are difficult to treat by either the statistical approach or the 
structural approach along. In this paper, we apply the algorithm to the recognition of a 
class of deformable image patterns, the handwritten Chinese characters. The 
procedure is given in Fig. 6. 




Fig. 6. The procedure of handwritten Chinese character recognition. 
The different parts in the procedure are described below. 



5.1 Preprocessing 

The preprocessing procedure includes noise filtering using morphological filters, 
skeleton extraction, and most importantly, segmentation of the image skeleton into 
image cells, which are referred to in section 3. The image cells are defined as the line 
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segments with uniform directional attributes. To extract the image cells, we detect the 
intersections on the skeleton, and compute the curvature at each point on the skeleton. 
It turns out that the segmentation results of our algorithm are satisfactory, being good 
approximations to the results of human judgements. 



5.2 Recognition of Deformable Characters 

There are 7920 “examples” of handwritten Chinese characters in our database. For 
each input character images, we compare, or match the “structure” of this character 
image with each of the examples in the database, and the results of this stuctural 
matching are represented by the matching scores^ g [0,1] ). The category of the input 
character are determined by comparing their “scores”. 

When comparing the “structure” of input character image with an example in the 
database, the structural matching proceeds from one radical to another, using the 
algorithm described in section 4.2. An example of this procedure is given in Fig. 7 to 
Fig. 9. The candidate stokes in a particular radical are extracted from the character 
image using the algorithm described in section 3.2. The score of the “best match” is 
the final matching score of the character example with the input image. 

Fig. 7 to Fig. 9 give the intermediate structural matching results for character H 
with an input character H. The characters that give the highest matching scores for 
the input character H are given in Table 2. 

Table 2. Characters with the highest matching scores for an input image of character 



Characters 


ppl 


P0 


PI 


PS 


ynf 


Scores 


0.7774 


0.4639 


0.3790 


0.3311 


0.3259 



6 Conclusions 

An algorithm for image pattern classification is proposed. The main feature of this 
classification algorithm is that it tries to combine the advantages of both the statistical 
and structural-syntactic approaches, while minimize the difficult parts in both of these 
two conventional approaches for pattern classification. 

Compared with the conventional statistical approaches, the strategy proposed here 
makes use of the structural features of images, so is expected to give better results and 
flexibility in some real applications, esp. the extraction of deformable image patterns 
from complex backgrounds. In some way, the combined statistical and structural 
approach for image pattern classification can be viewed as a natural extension of the 
methodology for 1 -dimensional signal classification, such as speech recognition. In 
our method, we have chosen the strokes as the primitive structural elements, or basic 
cells, since strokes are relatively simple, and correspond to the basic structural 
components in Chinese characters, just as the role of phonemes in speech models. The 
position/size relation model for a particular character in our system corresponds to the 
model for words, such as HMM, in speech recognition, and both of these models can 
be trained in a statistical way. 
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Future works on this direction include: 

1. More accurate modeling of strokes, so the strokes can he extracted more 
accurately; and more suitable selection of statistical models for structural matching. 

2. Accelerating the present algorithm by incorporating more constraints in the 
procedure of structural matching. 

3. Conduct experiments on extraction of characters from complex backgrounds, 
such as in the application of video/image information retrieving, using this algorithm; 
and apply this strategy to the classification of other deformable image patterns in 
complex backgrounds. 
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(j) (k) 

Fig.7. (a) The skeleton of input character image, (b)-(f) Some extracted 
candidates for stroke^ • — : the skeleton of the extracted strokes. — : the skeleton 
of the input character, (g)-(k) Some extracted candidates for stroke \1/. 
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(e) Score=0.8192 (f) Score=0.8086 (g) Score=0.7090 



Fig.8. (a)-(d)Results of structural matching for radical i5 and the corresponding 
scores, (e)-(g) Results of structural matching for radical W and the corresponding 
scores. — : the skeleton of the extracted radicals. — : the skeleton of the input 
character. 




(a) Score=0.7774 (b) Score=0.7303 (c) Score=0.7075 (d) Score=0.4487 

Fig.9. fa)-(d) Results of stmctural matching for character with the highest 
matching scores.—: the skeleton of the extracted character. — : the skeleton of 
the input character. The final matching score is 0.7774. 
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Abstract. In this paper we introduce a new hybrid system for the au- 
tomated recognition of hand-written characters - we combine the most 
promising approaches of the last decade, i.e., neural networks and struc- 
tural/syntactical analysis methods. The given patterns represent hand- 
written capital letters and digits stored in arrays. The first part of the 
hybrid system consists of the implementation of a neural network and 
yields a rapid and reliable pre-selection of the most probable charac- 
ters the given pattern may represent. Depending on the quality and the 
special characteristics of the given pattern a flexible set of characters is 
communicated to the second part of the hybrid system, the structural 
analysis module. The final decision is based on the evaluation of the 
presence of features, being characteristic for a specific character, in the 
underlying pattern. Basically, the structural analysis module consists of 
graph controlled array grammar systems using prescribed teams of pro- 
ductions. We describe the main parts of the implemented hybrid system 
and demonstrate the power of our approach. 

Key words: array grammars, character recognition, hybrid systems, 
neural networks, structural analysis 



1 Introduction 

The recognition of handwritten characters is still an important issue in automa- 
tically processing forms and vouchers. Although many systems are already used 
in various fields and sometimes are considered as hardly improvable, recognition 
rates in existing applications show that additional human review and postpro- 
cessing still plays an important and expensive role. Especially for recognition 
tasks where no context information can be taken into account for verifying the 
result of the automated recognition system, improving the reliability of the re- 
cognition procedure still constitutes a significant step forward. Thus our system 
is especially designed to be used in banks for automated processing of vouchers 
and recognition of hand- written data in fill-in- forms. 
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As the last years were dominated by research in neural networks (e.g., cf. 0) 
showing great new ideas, we wanted to go one step further and integrate them 
in a hybrid system together with a model using structural analysis (e.g., see 
0). As started in previous works (e.g., see [7]), our main intention was to find 
a way of enforcing the advantages of both approaches and compensating their 
shortcomings. While in |7] syntactical and statistical features were used as input 
for a statistical classifier, we implemented a backpropagation neural network (cf. 
0 ) for a fast pre-selection to speed up the final classifier component, which was 
implemented based on the theoretical model of graph controlled array grammars 
with prescribed teams of productions; these array grammars are designed in such 
a way that they come up with an exact analysis of the detected features. The 
main idea behind our model is to emphasize on the reliability of the final classifier 
and to overcome the lack of speed of the syntactical analysis by the pre-selection 
carried out by the neural network. 

The paper is organized as follows. In the next section we describe the main 
parts of the hybrid system and explain our motivation for its design. In Section 3 
we present the statistical pre-selection module based on neural networks and 
describe the flexible interface between this statistical pre-selection module and 
the final syntactical classifier (the structural analysis module). Section 4 deals 
with the structural analysis module and presents its main ideas and advantages. 
Finally, in Section 5 we give a short overview on the results obtained with the 
prototype of our hybrid system as well as an outlook to future work. 



2 The Hybrid System 

The complete system was especially designed for the recognition of hand-written 
capital letters and digits as they are used in fill-in-forms and vouchers, where 
the single letters are found on prescribed fixed positions and no overlapping has 
to be taken into account. 



2.1 Data Acquisition and Preprocessing 

The data basis of hand- written characters we used was acquired by Arnold (0) 
from hundreds of persons on specific forms. The scanned characters first were 
normalized to fill out a 320 x 400 grid in order to get comparable patterns. After 
the elimination of noisy pixels, the resulting arrays on the 320 x 400 grid were 
mapped on a 20 x 25 grid. These arrays on the 20 x 25 grid then were subjected to 
a thinning algorithm (e.g., see mi, EH) which finally yielded unitary skeletons 
of the digitized characters. 

2.2 Motivation for Designing a Hybrid System 

Regarding the set of characters we consider and the sub-patterns they consist 
of, obviously there are different degrees of similarity among characters. The 
quality of classifiers also varies depending on this degree of similarity. Hence, 
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our intention was to combine two different methods, the first one specialized on 
excluding all characters with low similarity to the underlying pattern, and the 
second one concentrating on working out very detailed informations from the 
pattern and its closeness to the remaining candidate characters. 



The neural network module. The task of the statistical module is to provide 
a group of candidate characters with high similarity to the given pattern. As this 
goal should be achieved very quickly, a neural network seemed to be appropriate 
for this task. We require that it must not exclude the correct solution, therefore 
its primary task is not to provide the correct result but to eliminate all characters 
that significantly differ from the given pattern. 



The structural analysis module. The output of the neural net - a reduced 
number of candidate characters - is transferred to the final classifier. The me- 
thod used there has to provide the possibility to check for very small details, 
because the differences between similar characters tend to consist of only some 
pixels. Structural analysis based on array grammars can be a powerful approach 
to this task (e.g., see P3|, H); hence, we developed a special array grammar for 
each character that analyses the pattern with respect to its inherent features of 
the character and the differences to other ones. The main problem of syntactical 
analysis usually is the inherent non-determinism leading to unpredictable pro- 
cessing times in the worst-case behaviour. Thus we enriched the array grammars 
with several control mechanisms, which allowed us to eliminate the undesirable 
non-determinism . 



3 The Neural Network Module 

The reason for needing a fast pre-selection is the way how structural analysis 
works: As the final classifier analyses the pattern with respect to each possible 
character, it also tries to find characters in patterns that do not show the charac- 
teristic features of these characters, which results in a big overhead in processing 
time. Neural networks provide the solution for this problem, because we get a 
fast elimination of characters that are obviously not represented by the given 
pattern and therefore need not be analysed in every little detail. For the remai- 
ning candidate characters we can use the more reliable and exact - but slower - 
structural recognition module to make a final decision. 

The most important aspect is the reliability of the recognition achieved by 
the neural net. As we design a hybrid system, there are two recognition results, 
and the final result is the combination of both. Although the motivation of 
the network was a speed-up of the system, we had to put emphasis on the 
reliability of the output, to garantuee the possibility of a correct classification 
by the structural analysis. This means that it is not so important, how many 
characters are excluded, but that the searched one is within the result of the pre- 
selection. There have been made a lot of studies about which recognition rate 
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can be reached by using neural network classifiers, but the very special interest 
in our neural net is the reverse performance, i.e., the reliability not to exclude 
the character most likely shown in the pattern. 



3.1 Input Vector 

Normal backpropagation with one input unit for every pixel leads to very poor 
results, as it is very hard for the net to develop internal representations for 
structures like lines and edges. Therefore we use structured backpropagation 
where we pre-process the shown pattern to detect features and to reduce the 
dimension of the input. The task is to reduce the number of input units from 500 
(each pixel representing one input unit) down to a reasonable number without 
loosing essential information about the pattern. 

Further we have to try to structure the input vector in such a way that 
information about the positions of pixels and their neighbourhood relations to 
other pixels are extracted, which is a kind of information neural nets are hardly 
able to extract from pixel-dependent inputs. Therefore we choose the following 
inputs: One part of the input vector consists of the pixel sums per column and 
row. This strategy brings good values for characters with straight lines, as H or 
F, but it is weak in extracting the features of characters consisting of curves as 
C or O. Therefore the second part of the input vector consists of the number of 
contiguous pixel groups per line for each column and row, which is an indicator 
for the number of lines cut in the corresponding row or column, respectively. 
Applying these ideas we reduce the dimension of the input vector from 500 to 
90 and achieve a spatial structuring of the input data. 

3.2 Communicating the Neural Network Result 

The only interface between the neural network component and the array gram- 
mar component consists of a simple linked list of records containing pre-selected 
characters and their probability. There is no need for sorting, because all the 
selected characters are processed anyway. 

The probability that the shown character is a member of the selected group 
has to be very close to hundred percent. If this were not guaranteed, the exact 
analysis would make no sense, as it could never bring a correct result. The result 
of the neural network analysis is a value (= probability) of each output unit 
(= character). When using a softmax activation function, those values can be 
interpreted as percentages. Depending on those percentages we have to make 
a decision, which units (characters) are fitting best and therefore have to be 
handed over to the final analysis. For the selection of the candidate characters 
computed by the neural network we considered several different strategies: 



Selection down to a percentage limit. The simplest possibility is to select 
all units whose percentage is higher than a fixed bias. The advantage of this 
method is its simplicity and clearness. The average number of chosen letters can 




A Hybrid System for the Recognition of Hand- Written Characters 



71 



easily be alternated by shifting the limit, results are easy to predict. In addition, 
the values need not to be sorted before we make our decision, as it only depends 
on the predefined limit. If no unit reaches this limit, we simply choose all the 
characters as possible candidates. But there are some serious disadvantages, as 
the distribution of the values is ignored, e.g., if there is a flat distribution with 
percentages around the limit, small differences can have an unacceptable great 
influence on the result. 



Selection by summing up to a percentage sum. This method avoids the 
problems of the one above in a simple way. We first sort the output-values in 
descending order, and then sum up the percentages assigned to the selected 
characters, until we reach a certain percentage sum limit. The average number 
of chosen characters can easily be altered by shifting the limit again, but a flat 
distribution results in a larger set of possible candidates. 



Selection of a predefined group of characters. Recognition by neural net- 
works tends to determine groups of characters having significant similarities 
between them. To be sure that all the characters that are rather similar for the 
neural network and therefore hard to distinguish, are communicated to further 
structural analysis, we implemented a kind of group selection. One method for 
determining suitable groups is obtained by doing a statistical evaluation of the 
results of the structural analysis. Another method is to create a test set for the 
trained neural network consisting of very clearly and correctly written characters 
and to evaluate which characters look similar for the trained network. Combi- 
ning the results of these two methods resulted in a system with pretty good 
reliability. These results can also be used to create a new network structure, 
where we do not use one output-unit for exactly one character, but one for each 
group of similar letters. Thus one character can be member of more than one 
group, which brings more flexibility into the groups (e.g., F is similar to E; in a 
different way F is also similar to P, but E and P are easy to distinguish). 



4 The Structural Analysis Module 



If emphasis is put on reliability, the most promising method for recognizing 
characters is to detect their structural features in a very exact way. This becomes 
the more important the sloppier the characters are written. When a given pattern 
is very similar to several characters, distinguishing them can depend on tiny 
lines or small differences in distances between lines and endpoints. Hence, the 
structural analysis module is designed to decide between quite similar characters 
by using very detailed information about the pattern and its analysed lines. 
As shown in 0, array grammars are a very promising method for syntactical 
character recognition; we further developed the ideas discussed there and could 
reduce parsing complexity especially by eliminating non-determinism (see jH])- 
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4.1 Arrays and Array Grammars 

Following ini, isj, p, PH, and ini, we introduce the following definitions and 
notations for (two-dimensional) arrays and array grammars. 

An array over the alphabet R is a function from to V U {#} ; # is 
the blank symbol. The set of all arrays over V be denoted by V*^. An array 
grammar is a structure G = (Vat, Vt, #, P, {('Cq, -S')}) , where V/v is the alphabet 
of non-terminal symbols, Vt is the alphabet of terminal symbols, Vm C\Vt = %, 
# ^ Vat U Vt; f* is a finite non-empty set of array productions over Vat U Vt 
and {(tqjS')} is the start array (axiom), vq is the start vector, and S is the 
start symbol. We say that the array B2 € V*'^ is directly derivable from the array 
B\ G by the array production p G P, denoted Bi =>p B2, if and only if the 
application of p to Bi yields B2- 

Control mechanisms. As shown in |S|, control mechanisms (cf. Pj) as control 
graphs are a suitable way to enhance the power of a grammar and can be ap- 
plied in the field of character recognition as well (see p]). Nevertheless our work 
showed that by using graph controlled array grammars in combination with sets 
of prescribed teams it is possible to eliminate non-determinism, especially in 
crucial situations with crossing points of lines. 

Graph controlled array grammars. A graph controlled array grammar is 
a construct Gp = {Vn,Vt,)(, {R, Lin, L fin) , {{vo, S)}) , where Vn and Vp are 
disjoint alphabets of non-terminal and terminal symbols, respectively; tq is the 
start vector, S' G Vat is the start symbol; i? is a finite set of rules r of the 
form {l{r) : p {I (r)) , a {I (r)) , (p {I {r))) , with I (r) G Lab{Gp), Lab{Gp) being 
a set of labels associated (in a one-to-one manner) to the rules r in R, where 
p{l (r)) is a set of array productions over W U Vp, <J {I (r)) C Lab{Gp) is the 
success field of the rule r, and p (I (r)) is the failure field of the rule r; Lin Q 
Lab{Gp) is the set of initial labels, and T/i„ C Lab{Gp) is the set of final 
labels. For r = (l{r) : p (I (r)) , a {I (r)) , p {I (r))) and v,w G (Vjv U we 

define {v,l (r)) =^Cp (w,t) if and only if 

— either an array production in p{l{r)) is applicable to v, the result of the 
application of this array production to v is w, and t G a {I (r)), 

— or no array production in p {I (r)) is applicable to v, w = v, and t G p {I {r)). 



Attribute vectors. In order to be able to store information about lines and 
crossing points (e.g., length, positions, etc.) we use additional attribute vec- 
tors assigned to each sentential form and therefore have to add a corresponding 
manipulator function / (to each array production) which changes the attribute 
vector. Formally this means that instead otp{l{r)) we now have {p{l{r)), f{l{r))); 
{p{l{r)) , f (l{r))) is applicable to a configuration (A,v) yielding {B,w) , i.e., 
(A,f)^(p(/(r))j(z(r.))) (B,w) , if and only if A ^p(/(,.))S and f{l{r)) (v) = w. 
With the information stored in the attribute vector it is possible to transparently 
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implement conditional branching according to the applicability of the manipu- 
lator function /. 

Prescribed teams. Although from a theoretical point of view the power of 
graph controlled array grammars cannot be enhanced further (cf. |Sj), we deci- 
ded to use prescribed teams (e.g., see 0) as additional control mechanism in 
order to eliminate non-determinism as well as to reduce parsing complexity while 
analysing parallel lines of equal lengths. For our purposes we could restrict our- 
selves to the case of (graph controlled) array grammars with prescribed teams 
of finite index: 

Let M be an arbitrary set; then any object {{x,rix) \ x S M} with Ux being 
a natural number for every x G M is called a multiset over M . Now let G = 
(V]sf,VT,#,P,{{vo,S)}) be an array grammar. Any finite multiset s over P is 
called a prescribed team over P; we shall also write s in the form (pi, . . . ,Pm), 
where the Pi, 1 < i < m, are exactly the array productions in s occurring in the 
right multiplicity. The array productions pi of the team (pi, . . . ,Pm) are applied 
in parallel to the underlying sentential form. In our special case of character 
recognition we especially use teams to define lines that are analysed in parallel, 
i.e., in this case the number m represents the number of lines analysed in parallel 
in one derivation step. 

For adding prescribed teams of finite index to our graph controlled array 
grammars with attribute vectors we allow p(l(r)) in a rule r to be a set of 
prescribed teams of array productions of the form (pi, . . . ,Pm)- 

Our final model of array grammars. The final model of array grammars 
we used is that of a graph controlled array grammars with attribute vectors and 
prescribed teams of finite index 

Gp = (Vat, Vt, #, (i?, Lin, Lfin ) , {('^0, *5')} , a ) , 

where Vjv and Vp are disjoint alphabets of non-terminal and terminal symbols, 
respectively; Uq is the start vector, S G Vjp is the start symbol; a is the initial 
attribute vector; i? is a finite set of rules r of the form {I (r) : {{pi{l{r)) , fi{l{r))) 

I I <i <k} ,(j{l (r)) , <p {I (r))), where I (r) G Lab (Gp), pt (I (r)) is a prescribed 
team of array productions over Vv U Vp of the form (pi, . . . ,Pm), /i(^(^)) is a 
manipulator function that operates on the attribute vector, 1 < i < k; a {I (r)) C 
Lab {Gp) is the success field of the rule r, and {I (r)) is the failure field of the 
rule r; Lin C Lah{Gp) is the set of initial labels, and T/m C Lab{Gp) is the 
set of final labels. 

Based on this theoretical model of array grammars defined above, we obtained 
a powerful and efficient tool for character recognition (see |H]) with rather small 
parsing complexity especially due to the avoidance of non-deterministic decisions 
at crossing points and at holes because of missing pixels in lines; moreover, the 
information stored in the attribute vector about lines and crossing points allowed 
us to make a detailed comparative analysis of the patterns with respect to a 
character and to define suitable error measures. 
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4.2 Implementation 

Practically our approach intends to process every characteristic line and to store 
its values. Therefore we first have to look for a reliable start point (SP), where 
the analysis of the whole character has to begin from. When we process a line, 
we go on until we reach its end, storing the values as its length and the positions 
of the starting and the end point as well as the attributes of all crossing-points 
found along the way in the attribute vector. Following this strategy, we start the 
first set of prescribed teams of array productions representing the first lines to 
be analysed until no team of productions is applicable any more. Then the start 
points for the next set of prescribed teams of array productions are calculated 
using the values stored in the attribute vector, and then the array grammar 
continues until all lines have been analysed. At the end, an evaluation function 
calculates an error value for the character using the information stored in the 
attribute vector. 

As an example we depict the control diagrams of the array grammars imple- 
mented for the characters F and P: 




The advantage of our approach is that we get very exact values about the 
different parts of the characters. Hence, if within a group of several similar 
letters it is necessary, we can concentrate our analysis on very small details and 
draw sharp borders between different characters, which in this special case of 
characters F and P are the distances of the end points of the horizontal lines on 
the right hand side. The final evaluation takes into account the coordinates of 
these most interesting end points and calculates the exact distance, which has 
a great influence on the calculation of the error value and therefore on the final 
decision. 
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5 Conclusion 

Our experiments with a prototype have shown that combining statistical and 
structural methods in a hybrid system allows for making use of the advantages 
of both approaches and for overcoming most of their shortcomings. The neural 
network together with the chosen selection policies turned out to provide a re- 
liable and fast exclusion of obviously different characters. The graph controlled 
array grammar systems using prescribed teams represent a powerful tool for the 
detection and analysis of structural features in hand-written uppercase letters 
and digits. The main advantages concern the lack of non-determinism thus avoi- 
ding backtracking during processing, the exactness of capturing detailed features 
of patterns, and the powerful methods for evaluation. 

5.1 Results 

With the selection policy using the group model, the neural network module 
achieved a nearly hundred percent reliable exclusion of a set of characters ob- 
viously not representing the underlying pattern. The number of characters com- 
municated to the structural analysis module varied with the quality of the pat- 
tern but never exceeded 9. Using the policy of summing up to a given percentage 
sum, the set of candidate characters could be reduced to 4 even in the worst case, 
but then the reliability was reduced to 94 percent. 

With the subsequent structural analysis we could obtain a final recognition 
rate of nearly hundred percent; failures only occurred with patterns that were 
hardly recognizable even for human beings without any further context infor- 
mations, which is exactly the situation our system was defined for. Due to the 
elimination of characters with little similarity by the neural network module, 
we could overcome problems occurring with the structural analysis component 
in the case of patterns with large holes in characteristic lines where sometimes 
due to these holes the whole group of characters with high similarity was classi- 
fied worse than other characters with less similarity (e.g., misclassification of a 
pattern as character 3 instead of B because of holes in the vertical line) . 

5.2 Future Research 

As the hybrid system is mainly designed for banking applications, the set of 
characters has to be extended by special symbols like those for special currencies. 
Another interesting idea is to automatically retrain the neural network using 
the results of the final classification of the underlying patterns by the structural 
analysis module. 
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Abstract. In this paper we compare recently developed and highly ef- 
fective sequential feature selection algorithms with approaches based on 
evolutionary algorithms enabling parallel feature subset selection. We in- 
troduce the oscillating search method, employ permutation encoding of- 
fering some advantages over the more traditional bitmap encoding for the 
evolutionary search, and compare these algorithms to the often studied 
and well-performing sequential forward floating search. For the empiri- 
cal analysis of these algorithms we utilize three well-known benchmark 
problems, and assess the quality of feature subsets by means of the sta- 
tistical Bhattacharyya distance measure. 

1 Introduction 

The problem of selecting a “good” subset of features from the total of available 
features arises in a great variety of problem domains in science, engineering, and 
economy. As the number of possible subsets increases exponentially with the 
number of total features, a variety of deterministic and nondeterministic Feature 
Seleetion (FS) algorithms have been developed in order to escape the Curse of 
Dimensionality. 

In this work we would like to compare some very efficient sequential FS 
algorithms with parallel FS techniques based on Evolutionary Algorithms (EAs). 
Basically, a sequential FS algorithm adds a feature to or leaves out a feature from 
the subset to be constructed in an iterative manner. Hence, the feature subset 
generation depends on initial and intermediate subsets. A parallel FS algorithm 
constructs a complete feature subset at once. The latter can be achieved by EAs 
which are generally believed to be well-suited for nonlinear, high-dimensional 
problems of exponential complexity (ISchwetel, 1995t [Mitchell, 1996| ). 

The two basic categories of approaches to FS are the Filter and the Wrapper 
approach | |John et ah, 1994|) . The crucial point in discriminating these methods 

* This work has been supported by AKTION Osterreich - Tschechische Republik un- 
der grant AKTION 23p20: “Comparison of statistical and evolutionary approaches 
to feature selection in pattern recognition” 
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is the absence or presence, respectively, of a classifier to assess the quality of a fea- 
ture subset. Either, a statistical measure independent of a classifier is employed 
(filter approach), or the error rate of a classifier determines the usefulness of a 
feature subset (wrapper approach). In this work we adopt the filter approach, as 
our main goal is to compare the performance of conventional to evolutionary FS 
algorithms. The additional use of specific classifiers (and training algorithms) 
would increase the already complex FS process interactions, and make it even 
more difficult to isolate the effectiveness of the FS algorithms to be compared. 

For experimental comparisons we employ the Bhattacharyya (B-) distance 
measure | |h'ukunaga, 199(J( |, and investigate algorithm performance on three dicho- 
tomous classification problems from the real world with the total number of 
features ranging from the single best feature to the (almost) full feature data 
set. 

1.1 Formulation of the Feature Selection Problem 

Following the statistical approach to pattern recognition, we assume that a pat- 
tern or object described by a real iD-dimensional vector x = {xi,X 2 , - • • G 

X C 72.^ is to be classified into one of a finite set of C different classes Q = 
{wi, a> 2 , • • • , wc}- The patterns are supposed to occur randomly according to 
some true class conditional probability density functions (pdfs) p*(x|o;) and the 
respective a priori probabilities P*{io). Since the class conditional pdfs and the 
a priori class probabilities are rarely known in practice, it is necessary to esti- 
mate these probability functions from the training sets of samples with known 
classification. 

If the pdfs are a priori known to be unimodal, probabilistic distance measures, 
e.g., Mahalanobis or Bhattacharyya distance, may be appropriate to evaluate the 
quality of a feature subset. As pointed out by Siedlicki and Sklansky (1988) the 
error rate with respect to the chosen measurement criterion J(-) is even better 
(computational feasibility provided) l|Siedlecki and Sklansky, 19*^^. 



1.2 Bhattacharyya Distance for Feature Selection 

In the following formulation B-distance measures the separability of normal 
distributions for two classes indexed by i and k (|Fukunaga, 199U||: 



8 2 J 2 

where p,i, pk are the feature mean vectors and Si and Ek denote class co- 
variance matrices for classes i and k, respectively. We point out, that a more 
general distance measure, such as the Chernoff distance, is in general closer to 
the error rate than B-distance, on the other hand such a measure is not easy to 
obtain ([Kailath, 1967|) . 
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2 Feature Selection Algorithms 

Assuming that a suitable criterion function J(-) has been chosen to evaluate 
the effectiveness of feature subsets, FS is reduced to a search for a (sub)optimal 
feature subset based on the selected measure. Although an exhaustive search is 
a sufficient procedure to guarantee the optimality of a solution, in many realistic 
problems it is computationally prohibitive. Therefore, in practice one has to 
rely on computationally feasible procedures to avoid exhaustive search for the 
price of suboptimal results. A comprehensive list of suboptimal procedures and 
the corresponding formulas can be found in (lUeviiver and Kittler, 1982|) . An 
excellent taxonomy of currently available FS methods in pattern recognition is 
presented in (|Jain and Zongker, T997t . 

The strategies used to find a useful subset of features range from the simple 
but popular sequential forward (SFS) and sequential backward selection (SBS), 
to more sophisticated but computationally more expensive algorithms. In the 
following section we will describe three of the latter which will then be compared 
on three benchmark problems. 

2.1 Sequential Forward Floating Search 

The sequential forward floating search (SFFS) |Fudil et ah, lyy^Il can be viewed 
as SFS with backtracking. SFS is a simple greedy algorithm starting with an 
empty feature subset, then iteratively adding the feature which maximizes the 
evaluation criterion of the feature subset up to a specified target size t. Obviously, 
this procedure does not account for nonlinear interactions of features, hence 
SFFS offers the following improvements: 

After adding a feature Xi to a subset of size k with J{k) in SFS manner 
(inclusion), SFFS tries to find a feature Xj with j ^ i which can be excluded 
so that the new J'(fc) > J{k) (conditional exclusion). This step is continued as 
long as a feature can be excluded under the above condition which demands 
that the best values of J(-) are recorded for each subset size (continuation of 
conditional exclusion). If an exclusion step is not successful, inclusion continues 
until the algorithm has “floated” through all possible subset sizes or has reached 
a certain target size t. 

While SFFS starts with an empty feature subset dominantly searching in 
forward direction, the sequential backward floating search (SBFS) “floats” in 
backward direction and analogously can be viewed as SBS with backtracking 
(IPudil et ah, 199-^ . 

2.2 Oscillating Search 

A very recent development is the oscillating search (OS) algorithm (Somol and 
Pudil, 2000). Most of the known suboptimal FS methods are based on step-wise 
adding of features to an initially empty feature set, or on step-wise removing 
features from an initial set of all features. A single search direction - forward or 
backward - is usually preferred. It is apparent that all these algorithms spend 



80 



H.A. Mayer et al. 



a lot of time testing feature subsets having cardinalities far distant from the 
required cardinality d. 

Unlike other methods, OS is based on repeated modification of the current 
subset Xd of d features. This is achieved by alternating the down- and upswings. 
The down-swing removes o “worst” features from the current set Xd to obtain a 
new set Xd-o at first, then adds o “best” ones to Xd-o to obtain a new current 
set Xd- The upswing adds o ’’good” features to the current set Xd to obtain a 
new set Xd+o at first, then removes o ’’bad” ones from Xd+o to obtain a new 
current set Xd again. The intitial subset Xd is generated randomly, and an up- 
and down-swing is achieved by SFS and SBS, respectively. 

Let us denote two successive opposite swings as an oscillation cycle. Then, 
the oscillating search consists of repeating oscillation cycles. The parameter o is 
termed oscillation cycle depth and should initially be set to 1. If the last oscilla- 
tion cycle did not find a better subset Xd of d features, the algorithm increases 
the oscillation cycle depth by setting o = o-l-l. Whenever any swing finds a better 
subset Xd of d features, the depth value o is restored to 1. The algorithm termi- 
nates, when the value of o exceeds the user-specified limit A. Besides the basic 
OS algorithm described here, variants can be found in dSomol and Pudil, 2000|). 



2.3 Evolutionary Algorithms for Feature Selection 

When employing EAs for FS, complete subsets are generated in parallel poten- 
tially eliminating the problems inherent to sequential FS methods. Basically, a 
start population of EA chromosomes (feature subsets) is generated randomly, 
and each individual receives a fitness according to the evaluation criterion J(-). 
The fitness determines the probability of an individual to be selected for ma- 
ting, where (usually) two individuals exchange their genetic information. The 
offspring then undergo a mutation operation and form the next generation. This 
evolutionary cycle is repeated for a user-defined number of generations, and the 
best individual (feature subset) of all generations represents the final solution. 

In the basic approach to feature selection using EAs a feature subset is en- 
coded as a binary vector {Bitmap Encoding) a = (ai,...,Od), where = I 
indicates the presence of the f— th feature in the subset, while the absence of 
the *— th feature is expressed by Oi = 0. The bitmap encoding is well suited 
for an evolutionary feature selection technique based on a wrapper approach 
(jSiedlec M^iM . 

In ( |Puncf^^i^M993| | the bitmap encoding was generalized to a weighted en- 
coding, where features were assigned different weights resulting in a modification 
(warping) of the feature space for a k-NN classifier. In (|Yang and Honavar, I997] l 
bitmap encoding has been employed for the evolutionary search of feature sub- 
sets for an ANN classifier. A mixture of filter and wrapper approach is presented 
in (IChaikla and Qi, 1999t , where an EA with a fitness function comprising k- 
NN classifier accuracy and multiple correlation coefficients is employed for FS. 
Although all these works could report on improvements in accuracy or subset 
size, none of them compared evolutionary to conventional FS methods. 
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With the filter approach and the inherent monotony of the B-distance mea- 
sure we are using in this work, bitmap encoding together with a fitness function 
only assessing B-distance would simply result in convergence to the full fea- 
ture set. A penalty or cost term could be introduced in the fitness function 
/(a) = J{a) +p{a), where the function p{a) depends on the number of features 
present in a. The penalty term p{a) can then be used to favor feature subsets 
of a given cardinality. 

Intuitively, for the experimental framework in this work the use of a penalty 
term unnecessarily complicates the evolutionary search, as the EA has to find 
a subset size which is known from the very beginning. Moreover, as soon as 
the EA arrives at the correct size of the feature subset it will likely stay with 
this solution, because only very specific interactions of genetic operators will 
allow the transition to a different solution of the given target size. Experimental 
results confirmed these assumptions, hence we introduce a permutation encoding 
method. 



3 Permutation Encoding 

As outlined in Section instead of the more traditional bitmap encoding for 
the generation of subsets, we experimented with variants of permutation enco- 
ding which is primarily used for order problems such as the Traveling Salesperson 
Problem (TSP). The bases 0 of the EA chromosome are integers building a per- 
mutation as shown in Figured 



Subset Size = 2 



4 1 2 3 6 5 0 0100100 



6 1 4 5 2 3 



1000001 



Fig. 1. Permutation encoding for the generation of subsets. 



For the generation of a subset of a specific target size t the chromsome is 
scanned from left to right, and the first t bases representing a feature index 
are used to construct the subset. The permutation encoding ensures that each 
feature taken into account is different, and always yields a subset of the given 
target size t. 

In order to preserve the permutation property of the chromosomes, specific 
genetic operators have been devised. The mutation operator simply exchanges 

^ Many researchers use the term gene, but a wild type gene is a much more complex 
structure, and the term base is equally a synonym for an atomic information unit in 
biology. 
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two random bases on the chromosome with a given mutation rate Pm (usually 
in the range of 0.001 — 0.01). One of the most prominent crossover operators for 
permutation encoding is the Partially Matched Crossover (PMX) proposed in 
(lOoldberg and Lingle, 1985|). Its basic mechanisms are presented in Figure |3 
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Fig. 2. Partially Matched Crossover (PMX). 



Generally, for crossover two parent chromosomes are selected, and crossover 
is performed according to a user-defined crossover rate Pc (usually in the range 
of 0.6 — 1.0). If no crossover occurs, the two parents are simply copied to two 
offspring chromosomes. In the crossover phase two crossover sites are selected 
randomly (sites are the same for both parents). Then, the bases in between 
the crossover sites are exchanged. Up to this point we have exactly described 
the very common 2-point Crossover, but if the bases were only exchanged, the 
permutation property would be lost. Thus, each base to be copied to the other 
parent is searched in that parent and swapped with the base currently at the 
locus, where the exchange takes place (just like a single mutation). In doing 
so the partial order between the crossover sites can be exchanged between the 
parents without corrupting the permutation property. 

As can be seen in Figure Q] parts of the chromosome are not expressed. If 
both crossover sites fall into this region, it might appear that crossover does 
not change the expressed feature subset. But assuming an exemplary target 
subset size t = 2 in Figure 0 it can be observed that with PMX the features in 
the subset can be altered even under this condition. Though, a mutation in an 
unexpressed region of the chromsome does not effect the encoded subset, it can 
indirectly take an influence by means of a subsequent PMX crossover. 



4 Benchmark Problems 

A brief description of the data sets used for FS experiments is given below. 
The main criterion for selecting these specific benchmarks is the rather high 
number of features challenging the search capabilities of the FS algorithms to 
be compared. 
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Breast This is the diagnostic Wisconsin Diagnostic Breast Center database 
containing 284 examples. From 30 features a prediction into the classes ma- 
lignant and benign is aspired (from the UCI Machine Learning Repository) 
([Blake and Merz, 19i)5| ). 

Sonar This data set is used to classify sonar signals into signals bounced off 
a metal cylinder and those bounced off a roughly cylindrical rock (Gorman 
and Sejnowski, 1988). It has 105 examples with 60 continuous inputs (from 
the CMU neural networks benchmark collection) . 

Mammo This is a mammogram data set from the Pattern Recognition and 
Image Modeling Laboratory at the University of California, Irvine, con- 
taining 86 examples. From 65 features a prediction into the classes mali- 
gnant and benign is aspired (from the UCI Machine Learning Repository) 
([Blake and Merz, 1998j) . 



4.1 Experimental Setup 

Experiments have been run for the three benchmark data sets using the three 
FS algorithms to be compared. For each problem all subset sizes from < = 1 to 
t = n — 1 (with n being the total number of features) have been investigated. 
The following parameters have been used with all the experiments in this paper: 
SEES Parameters: Initial subset size = 2 (generated by SFS), Runs = 1 (de- 
terministic behavior). 

OS Parameters: A = 50% (of the number of features). Runs = 20. 

EA Parameters: Population Size = 50, Generations = 100, Crossover Proba- 
bility Pc = 0.6, Mutation Probability Pm = 0.01, Crossover = PMX, Selection 
Method = Binary Tournament, Runs = 20 (the EA parameters are fairly stan- 
dard and are not based on extensive experiments). 

5 Experimental Results 

For the Breast data set containing 30 features, we were able to compute the 
optimal subset of each cardinality by means of a yet unpublished Branch-and- 
Bound method. Thus, Figure 0 shows the differences (error) of the compared 
algorithms to the optimal subsets. 

It can be observed that OS yields the smallest mean errors, followed by SFFS, 
and EA. When looking at the best results of OS and EA, the optimal result 
was always found within the 20 runs spent. As SFFS is deterministic, the best 
results are identical to the mean results shown in Figure 0 (left). The relatively 
large mean error for EA and a t = 2 might be a side effect of permutation 
encoding, as most mutations fall in an unexpressed region of the chromosome, 
and the population quickly converges to a local optimum. However, with bitmap 
encoding it is even difficult to find any solution with subset size t = 21 

In terms of computational cost (Figure 0 (right)) OS is the most expensive, 
but also delivers the best results. However, increase of the number of generations 
of the EA would further improve its solutions. 
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Fig. 3. Breast - Mean errors (left) and mean number of evaluations to find the best 
B-Distance (right) of Sequential Floating Forward Search (SFFS), Oscillating Search 
(OS), and Evolutionary Algorithm (EA) (averaged on 20 runs). 



The best results of the compared FS algorithms and the corresponding com- 
putational cost for the Sonar data set is depicted in Figure 0| 
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Fig. 4. Sonar - Best B-distances (left) and mean number of evaluations to find the 
best B-Distance (right) of Sequential Floating Forward Search (SFFS), Oscillating 
Search (OS), and Evolutionary Algorithm (EA) (averaged on 20 runs). 



Although, the best results found seem to be very similar in Figure 0 (left), 
a closer look at the numbers reveals the same order of performance as for the 
other data sets. The sudden decrease of the B-distance □ is a strong indicator 
that some of the features in the data set are linearly dependent. Obviously, 
EA exhibits these problems earlier which remains to be studied, but a possible 
explanation is the more “intelligent” inclusion and exclusion of features with OS, 
whereas EA performs a blind search. 

Very similar things can be said about the results for the Mammo data set 
shown in Figure O 

Again, OS and EA deliver the best results, but starting with a subset size 
around t = 30 EA marginally drops behind SFFS. A main reason for that be- 
havior might be the increased complexity of that problem (65 features), while 
keeping the number of generations of the EA fixed at 100, which even compared 
to SFFS results in a much lower number of subset evaluations. Accordingly, the 

^ The numerical value of 0.0 is arbitrary and is used by our software to indicate 
impossible matrix operations. 
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Fig. 5. Mammo - Best B-distances (left) and mean number of evaluations to find 
the best B-Distance (right) of Sequential Floating Forward Search (SFFS), Oscillating 
Search (OS), and Evolutionary Algorithm (EA) (averaged on 20 runs). 



computational cost (Figure 0 (right)) clearly confirms the well established fact 
that EAs find good solutions in a short time (but are less effective in honing 
these solutions). 

6 Outlook 

The results presented in this paper are quite encouraging considering that SFFS 
has been evaluated as one of the best available FS algorithms in (Jain and 
Zongker, 1997). Not only OS and EA generate better results than SFFS, but 
EA also takes comparable, if not smaller computation time. Clearly, these results 
have to be confirmed with a number of additional data sets and the algorithms 
should also be investigated in the environment of a wrapper approach employing 
a number of different classifiers. A very promising future research direction could 
be the hybridization of EA and OS combining the speed of an EA to find a good 
solution with the ability of the OS to improve an existing feature subset. 

7 Acknowledgements 

We would like to thank the executives of the Aktion board for their very friendly 
and motivating support of this inter-European research initiative. Many thanks 
to our project colleagues Jiri Grim and Roland Schwaiger for their ideas having 
already initiated new lines of research within our cooperation. We also thank 
Markus Amersdorfer, Martin Angerer, Sandor Herramhof, and Harald Schweiger 
for development of a Java prototype for evolutionary feature selection in their 
software engineering course. 



H.A. Mayer et al. 



References 

[Blake and Merz, 1998]Blake, C. and Merz, C. (1998). 

http://www.ics.uci.edu/~mlearn/mlrepository.html. WWW Repository, Uni- 
versity of California, Irvine, Dept, of Information and Computer Sciences. 

[Chaikla and Qi, 1999]Chaikla, N. and Qi, Y. (1999). Genetic Algorithms in Feature 
Selection. In IEEE International Conference on Systems, Man, and Cybernetics, 
pages V - 538-540. IEEE. 

[Devijver and Kittler, 1982]Devijver, P. and Kittler, J. (1982). Pattern Recognition: A 
Statistical Approach. Prentice. 

[Fukunaga, 1990]Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. 
Academic Press. 

[Goldberg and Lingle, 1985]Goldberg, D. E. and Lingle, R. (1985). Alleles, Loci, and 
the Traveling Salesman Problem. In Grefenstette, J. J., editor. Proceedings of 
the First International Conference on Genetic Algorithms and their Applications, 
pages 154-159. Texas Instruments, Inc. and Naval Research Laboratory, Lawrence 
Erlbaum Associates. 

[Jain and Zongker, 1997]Jain, A. and Zongker, D. (1997). Feature Selection: Evalua- 
tion, Application and Small Sample Performance. IEEE Transactions on PAMI, 
19(2):153-158. 

[John et al., 1994]John, G., Kohavi, R., and Pfieger, K. (1994). Irrelevant Features and 
the Subset Selection Problem. In Proceedings of the Eleventh International Confe- 
rence on Machine Learning, pages 121-129, San Mateo, GA. Morgan Kaufmann. 

[Kailath, 1967]Kailath, T. (1967). The divergence and bhattacharyya distance mea- 
sures in signal selection. IEEE Transactions on Communications Technology, 
15(l):52-60. 

[Mitchell, 1996]Mitchell, M. (1996). An Introduction to Genetic Algorithms. Gomplex 
Adaptive Systems. MIT Press, Gambridge, MA. 

[Pudil et al., 1994]Pudil, P., Novovicova, J., and Kittler, J. (1994). Floating search 
methods in feature selection. Pattern Recognition Letters, 15:1119-1125. 

[Punch et al., 1993]Punch, W. F., Goodman, E. D., Pei, M., Chia-Shun, L., Hovland, 
P., and Enbody, R. (1993). Further Research on Feature Selection and Glassification 
Using Genetic Algorithms. In Forrest, S., editor. Fifth International Conference on 
Genetic Algorithms, pages 557-564, San Mateo, GA. Morgan Kaufmann. 

[Schwefel, 1995]Schwefel, H.-P. (1995). Evolution and Optimum Seeking. Sixth- 
Generation Computer Technology Series. Wiley, New York. 

[Siedlecki and Sklansky, 1988]Siedlecki, W. and Sklansky, J. (1988). On automatic 
feature selection. International Journal of Pattern Recognition and Artificial Intel- 
ligence, 2(2):197-220. 

[Siedlecki and Sklansky, 1989] Siedlecki, W. and Sklansky, J. (1989). A Note on Genetic 
Algorithms for Large-Scale Feature Selection. Pattern Recognition Letters, 10:335- 
347. 

[Somol and Pudil, 2000]Somol, P. and Pudil, P. (2000). Oscillating Search Algorithms 
for Feature Selection. In Submission to the 15th International Conference on Pat- 
tern Recognition, Barcelona. 

[Yang and Honavar, 1997]Yang, J. and Honavar, V. (1997). Feature Subset Selection 
Using a Genetic Algorithm. In Genetic Programming, pages 380-385. 




Selection of Classifiers Based on Multiple Classifier 

Behaviour 



Giorgio Giacinto, Fabio Roli, and Giorgio Fumera 

Dept, of Electrical and Electronic Eng. - University of Cagliari 
Piazza d’ Armi, 09123 Cagliari, ITALY 

Phone +39-070-6755862 Eax +39-070-6755900 
e-mails {giacinto , roli , fumera }@diee . unica . it 



Abstract. In the field of pattern recognition, the concept of Multiple Classifier 
Systems (MCSs) was proposed as a method for the development of high 
performance classification systems. At present, the common “operation” 
mechanism of MCSs is the “combination” of classifiers outputs. Recently, 
some researchers pointed out the potentialities of “dynamic classifier selection” 
(DCS) as a new operation mechanism. In this paper, a DCS algorithm based on 
the MCS behaviour is presented. The proposed method is aimed to exploit the 
behaviour of the MCS in order to select, for each test pattern, the classifier that 
is more likely to provide the correct classification. Reported results on the 
classification of different data sets show that dynamic classifier selection based 
on MCS behaviour is an effective operation mechanism for MCSs. 

Keywords: Multiple Classifier Systems, Combination of Classifiers, Dynamic 
Classifier Selection, Image Classification 



1. Introduction 

In the field of pattern recognition, a number of multiple classifier systems (MCSs) 
based on the combination of outputs of a set of different classifiers have been 
proposed [1]. For each pattern, the classification process is performed in parallel by 
different classifiers and the results are then combined. Many combination methods, 
e.g., voting, Bayesian and Dempster-Shafer approaches, are based on "decision 
fusion" techniques that combine the classifications provided by different classifiers 
[1]. As an example, the “majority” voting rule interprets each classification result as a 
"vote" for one of the data classes and assigns the input pattern to the class receiving 
the majority of votes. Such methods are able to improve the classification accuracy of 
individual classifiers under the assumption that different classifiers make 
“independent” errors. However, in real pattern recognition applications, it is usually 
difficult to design a set of classifiers that exhibit an independent behaviour on the 
whole feature space. In order to avoid the independence assumption, Huang and Suen 
proposed a combination method, named "Behaviour Knowledge Space" (BKS), that 
exploits the behaviour of the MCS [2]. The behaviour of the MCS for each training 
pattern is recorded as a vector whose elements are the decisions of the classifiers of 
the MCS. For each unknown test pattern the MCS behaviour is considered, and the 

F.J. Fern et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 87-93, 2000. 
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training patterns that exhibit the same MCS behaviour are identified. The unknown 
pattern is then assigned to the class most represented among such training patterns. 

In this paper, the MCS behaviour is exploited in order to perform a “dynamic 
classifier selection” (DCS) aimed to select, for each unknown pattern, the classifier 
that is more likely to classify it correctly [3-5]. The rationale behind this procedure 
can be explained by observing that it is easy to design an MCS where, for each 
pattern, there is at least one classifier that classifies it correctly. In order to select this 
classifier, the training patterns with the same MCS behaviour are considered and the 
classifier with the highest accuracy is chosen. In Section 2, the concept of MCS 
behaviour is defined and a selection function is presented. Experimental results and 
comparisons are reported in Section 3. 



2. Dynamic Classifier Selection Based on MCS Behaviour 



2.1 Problem Definition 

Let us consider a classification task for M data classes (Oj,.., a)[y[. Each class is 
assumed to represent a set of specific patterns, each pattern being characterized by a 
feature vector X. Let us also assume that K different classifiers, C,-, / = have 

been trained separately to solve the image classification task at hand. Let Cj(X) g 
{1,.., M} indicate the class label assigned to pattern X by classifier Cj. 



2.2 Multiple Classifier Behaviour 

For each test pattern X*, a vector made up of K elements Cj(X*) is available. Let 
us indicate with MCB(X*) = (Cy(X*), C 2 (X*),.., C^X*)} the vector that represents 
the "Multiple Classifier Behaviour" (MCB) for pattern X*. MCB(X*) represents the 
behaviour of the set of classifiers for the considered pattern. 

It is worth noting that also the Behaviour Knowledge Space proposed by Huang 
and Suen tries to exploit the information contained in the behaviour vector [2]. 
However, while the goal of BKS is to combine the results of different classifiers, the 
proposed method is aimed to select the classifier out of K that is more able to 
correctly classify the pattern X*. To this end, let us consider the subset of the training 
patterns with the same MCB(X) of the test pattern X*. In other words, we are 
considering all the training patterns X that satisfy the condition Cj(X) = C^(X*), V/ = 
Let us indicate this subset of the training patterns with S(X*). The goal of our 
procedure is to select the most accurate classifier out of K by taking into account the 
behaviour of the patterns in S(X*). To this end, for each classifier, the classification 
accuracy related to the patterns in S(X*) is computed. As an example, such 
classification accuracy can be obtained as the fraction of correctly classified patterns 
belonging to S(X*). The K classifiers are then ranked according to the measure of 
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classification accuracy and the one with the highest accuracy is then chosen to 
classify the unknown pattern X*. 

If a given MCB(X*) is not exhibited by any training patterns, then S(X*) can be 
made up of the subset of training patterns whose MCB(X) differ from MCB(X*) for a 
limited number of elements c < K. Thus S(X*) can contain patterns whose MCS 
behaviour is similar to the one exhibited by the unknown pattern. 

The above procedure is based on the assumption that, for all the patterns X that 
exhibit the same MCS behaviour, there exists at least one classifier that is able to 
classify them correctly. In other words, let us consider each classifier in the MCS as 
an "expert". When the experts disagree, we consider all the known cases where the 
experts exhibited the same disagreement and we select the expert who exhibit the 
highest accuracy for such cases. 



2.3 A Measure of Classifier Accuracy 



In this section a measure of classifier accuracy that takes into account the 
uncertainties in the classification process, is proposed. Let us assume that the 
classifier Cj assigns the test pattern X* to the data class coj. We indicate this by 
Cj(X*) = i. It is easy to see that the accuracy of classifier Cj in S(X*) can be estimated 
as the fraction of patterns belonging to S(X*) assigned to class (Oj by Cj that have 
been correctly classified. However, if the classifier provides estimates of the class 
posterior probabilities, we propose to take these probabilities into account in order to 
inprove the estimation of the above measure of classifier accuracy (CA). Given a 
pattern X ea, , i = 1,--,M, belonging to S(X*), the | X) provided by the classifier 

Cj can be regarded as a measure of the classifier accuracy for the pattern X. CA can 
then be estimated by computing the probability that the test pattern X* is correctly 
assigned to class coi by the classifier Cj. According to the Bayes theorem, this 
probability can be estimated as follows: 




G I C 






P(C,.(X*) = ilX*Eftt,.)p(ftt,) 
X" / (Cj (X * ) = i I X* E )p(®„ ) 



( 1 ) 



where P^j(X*) = ; I X* e ^ is the probability that the classifier Cj classifies the 
patterns belonging to class correctly. This probability can be estimated by 
averaging the posterior probabilities Pj(cOi IX„ e®,) provided by the classifier Cj on 
the training patterns X_^ in S(X*) that belong to the class O;. In other words, if is the 
number of patterns in S(X*) that belong to the class ro,, then 



p(c/X*) = ilX*E 



)= 



PjiWj I X„ E ®,.) 



( 2 ) 
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The prior probabilities P(w-) can be estimated as the fraction of patterns in S(X*) 
that belong to class Wj. If we let N be the total number of patterns belonging to 
S(X*), then 



ho)i) = 




( 3 ) 



Therefore, substituting equations (2) and (3) in equation (1) the following estimate 
of CA for classifier Cj is obtained: 



CA.(X*) 



y" y 



( 4 ) 



where, in order to handle the “uncertainty” in the size of S(X*), the class posterior 
probabilities can be “weighted” by a term = 1/ dfi, where is the Euclidean 
distance of the pattern X^ belonging to S(X*) from the test pattern X*. 



2.4 An Algorithm for DCS Based on MCS Behavionr 

In the following, a dynamic classifier selection algorithm is described. 

Input parameters: test pattern X*, MCB(X) for the training data, the rejection 
threshold value, and the selection threshold value 
Output: classification of the test pattern X* 

STEP 1: Compute MCB(X*). If all the classifiers assign X* to the same data class, 
then the pattern is assigned to this class. 

STEP 2: Identify the training patterns whose MCB(X) = MCB(X*). 

STEP 3 : Compute CAj (X*) , j = 1 , • . . ,K 

STEP 4: If CAj(X*)< rejection-threshold Then Disregard classifier Cj 
STEP 5: Identify the classifier exhibiting the maximum value of CAj(X*) 

STEP 6: Eor each classifier Cj, compute the following differences 

dj = ^A^{X*)-CAj(X*)] 

STEP 7: If y j,j *m, dj > selection-threshold Then Select Classifier 

Else Select randomly one of the classifiers for which dj < selection-threshold 
Step 3 identify the training patterns that make up S(X*). If this set is empty, the 
training patterns with a MCB(X) that differs from MCB(X*) for c < K elements can 
be included in S(X*). 

Step 4 is aimed at excluding from the selection process the classifiers that exhibit 
CA values smaller than the given rejection threshold. 

Step 6 computes the differences di in order to evaluate the “reliability” of the 
selection of the classifier C^. If all the differences are higher than the given selection- 
threshold, then it is reasonably "reliable" that classifier should correctly classify 
the test pattern X*. Differently, a random selection is performed among the classifiers 
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for which dj < selection-threshold. Alternatively, random selection can be substituted by 
the combination of these classifiers. 



3. Experimental Results 

Experiments have been carried out using three data sets contained in the ELENA 
(Enhanced Learning for Evolutive Neural Architecture) data base. In particular, we 
used the following data sets: phoneme_CR (French phoneme data), satimage_CR 
(remote sensing images acquired by the LANDSAT satellite), and texture_CR 
(images of the Brodatz’s textures). Further details on these data sets can be found via 
anonymous ftp at ftp.dice.ucl.ac.be in the directory pub/neural- 
nets/ELENA/databases . In our experiments, we used the same data classes, features, 
and numbers of training and test patterns used in [4] . 

A set made up of five different classifiers was used (Table 1): the k nearest 
neighbours classifier, the multilayer perceptron (MLP) neural network, the C4.5 
decision tree [6], the quadratic Bayes classifier (QB) and the linear Bayes classifier 
(LB). For the sake of brevity, we refer the reader interested in more details on the 
design of these classifiers to [4]. Tables 1 and 2 show the percentage accuracies of the 
individual classifiers for the data sets used. We randomly partitioned each data set 
into two equal partitions, keeping the class distributions similar to that of the full data 
set. Each partition was firstly used as training set and then as test set. In Table 1, the 
accuracies for each trial are reported, while in Table 2 the accuracies are reported as 
the average of the two results. 



Table 1. Percentage accuracies provided by the five classifiers applied to the ELENA data 
sets. Results obtained on each of the two partitions of the data sets are reported. 





Phoneme 


Satimage 


Texture 


Classifier 


Trial 1 


Trial 2 


Trial 1 


Trial 2 


Trial 1 


Trial 2 


k-nn 


86.38 


89.16 


88.11 


87.06 


97.75 


97.75 


MLP 


86.79 


85.79 


85.62 


82.77 


98.51 


98.51 


C4.5 


83.72 


85.83 


85.78 


85.54 


91.38 


90.51 


QB 


78.91 


78.42 


85.93 


85.48 


98.87 


99.20 


LB 


77.13 


75.42 


83.35 


81.97 


97.56 


97.27 



Table 2. Average percentage accuracies provided by the five classifiers applied to the ELENA 
data sets. 



Classifier 


Phoneme 


Satimage 


Texture 


k-nn 


87.77% 


87.59% 


97.75% 


MLP 


86.29% 


84.20% 


98.51% 


C4.5 


84.78% 


85.66% 


90.95% 


QB 


75.41% 


85.78% 


99.04% 


LB 


73.00% 


83.31% 


97.42% 
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Tables 3 and 4 show the performances of the proposed selection method (DCS- 
MCB) and the performances of the combination method based on the majority voting 
rule. For comparison purposes, the performances of the best individual classifier and 
the “oracle” are also shown. The "oracle" is the ideal selector which always chooses 
the classifier providing the correct classification if any of the individual classifier 
does so. 

Table 3. Percentage accuracies provided by the proposed DCS method (DCS-MCB), the 
combination by majority voting rule, the best classifier of the ensemble, and the oracle. Results 
obtained on each of the two partitions of the data sets are reported. 





Phoneme 


Satimage 


Texture 


Classifier 


Trial 1 


Trial 2 


Trial 1 


Trial 2 


Trial 1 


Trial 2 


Oracle 


97.52 


97.08 


95.99 


95.71 


99.93 


99.93 


Best classifier 


86.79 


89.16 


88.11 


87.06 


98.87 


99.20 


DCS-MCB 


87.75 


93.34 


88.39 


89.49 


98.89 


99.67 


Majority rule 


86.16 


92.23 


88.31 


90.32 


99.24 


99.20 



The DCS-MCB method always outperformed the best classifier of the ensemble, 
so pointing out that dynamic classifier selection is a method for improving the 
accuracies of individual classifiers. Accuracies provided by combination-based MCSs 
are sometimes better than the ones of selection-based MCSs. This result is very 
reasonable, as classifiers very “different”, and, therefore, very “independent” were 
used in these experiments. However, our method outperformed the majority rule 
combination method in the most of experiments. 

Table 4. Average percentage accuracies provided by the proposed DCS method (DCS-MCB), 
the combination by majority voting rule, the best classifier of the ensemble, and the oracle. 



Classifier 


Phoneme 


Satimage 


Texture 


Oracle 


97.30 


95.85 


99.93 


Best classifier 


87.77 


87.59 


99.04 


DCS-MCB 


90.55 


88.94 


99.28 


Majority rule 


89.20 


89.32 


99.22 



4. Conclusions 

In this paper, we have addressed the “open” research topic of selection-based 
MCSs. In particular, we presented a dynamic classifier selection method aimed at 
selecting, for each unknown pattern, the most accurate classifier of the MCS on the 
basis of the MCS behaviour on the unknown test pattern. 

Reported results showed that dynamic classifier selection based on MCS behaviour 
always outperforms the best classifier in the ensemble. In addition, our selector 
exhibited performances that are close or better than the ones exhibited by the majority 
voting combination. 
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Abstract. In this paper, a mixture-of-subspaces model is proposed to 
describe images. Images or image patches, when translated, rotated or 
scaled, lie in low-dimensional subspaces of the high-dimensional space 
spanned by the grey values. These manifolds can locally be approxima- 
ted by a linear subspace. The adaptive subspace map is a method to 
learn such a mixture-of-subspaces from the data. Due to its general na- 
ture, various clustering and subspace-finding algorithms can be used. If 
the adaptive subspace map is trained on data extracted from images, 
a description of the image content is obtained, which can then be used 
for various classification and clustering problems. Here, the method is 
applied to an image database retrieval problem and an object image 
classihcation problem, and is shown to give promising results. 



1 Introduction 

A method often used for representing image data is to use the image function 
on a grid of pixel positions, i.e. I{x,y). A problem with this representation is 
that it is rather unnatural in a number of ways. In principle, although the space 
spanned by all pixel positions (a;, y) is high-dimensional, it would seem to lend 
itself well for use in standard pattern recognition approaches. However, if the 
image is just slightly translated, rotated or scaled, the distribution of the data in 
the high-dimensional space changes completely, whereas to a human observer the 
data still looks very similar. In fact, transformed versions of an image all lie on 
an m-dimensional manifold in the high-dimensional space spanned by all pixel 
grey values, where m is the number of degrees of freedom present in the set of 
transformations 0. Although this manifold may be low-dimensional, it is likely 
to fill a large portion of the high-dimensional space. The approach proposed 
here is to describe an image, or image patches, using subspaces in the high- 
dimensional grey value space rather than just points. A mixture of subspaces 
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can then be used to model the manifold discussed earlier. It draws on earlier 
work by Kohonen, the adaptive subspace self-organising map (ASSOM) 0 and 
work by Hinton et. al. Pj- Their approaches will be discussed in Sect.|3 

In Sect. El the basic version of the proposed method, the adaptive subspace 
map, or ASM, will be introduced. Next, in Sect. 0, it is applied to an image 
database retrieval problem. In this problem, an ASM is trained on each image. 
Then, a distance measure between the ASMs is used to retrieve images. 

Sect. Eldiscusses a slightly different problem, that of object recognition. Here 
ASMs are trained on a collection of images representing a certain class, and 
histograms of an image mapped onto these ASMs are used to classify it. Finally, 
Section El will draw some conclusions and give ideas for further work. 

2 Adaptive Image Description 

Kohonen jSj proposed an extension of his self-organising map, which uses sub- 
spaces Sj in each node rather than just single weights. This adaptive subspace 
self-organising map, or ASSOM, is based on training not just using single sam- 
ples but sets Ek of slightly translated, rotated and/or scaled signal or image 
samples, called episodes. These episodes are treated as a single entity, that is, 
samples are assigned as a group to a subspace based on a distance measure bet- 
ween an episode and a subspace, which is the minimum projection error of any 
sample xGEk- 

D{Ek,Sj) = min \\x - x^W , (1) 

x&Ek 

where x^ is the projection of x onto subspace Sj. To train the ASSOM, samples 
drawn from a signal or an image are converted into episodes by creating slightly 
transformed versions of the original sample. The distance between each node 
and the episode is then calculated, and the winning node is defined as that 
node to which the episode has minimum distance. In the adaptation phase, the 
winning node’s subspace, and that of its neighbours, is rotated to better fit the 
just presented episode. 

The ASSOM gives good results, but is extremely slow in training. This is cau- 
sed by the learning of the subspaces by rotating them, which demands careful 
and prudent setting of learning parameters, but also by the updating of neigh- 
bourhoods to obtain a topologically correct map. If one drops these demands, 
i.e. just finds the subspaces in a batch-mode operation (e.g. using principal com- 
ponent analysis (PCA)), and performs non-topologically correct clustering, the 
resulting system would be greatly simplified. Such a system would come close 
to the system described by Hinton et al. jS), a mixture of local linear models. 
However, their method does not use the idea of episodes, and was mainly used 
on small images containing entire objects (handwritten digits). 

In this paper a system is described which combines the best aspects of these 
two approaches. On the one hand, knowledge of the invariances one desires the 
models to display can be used when creating the training set, as was done by 
Kohonen. On the other hand, to avoid the extremely long training times of the 
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ASSOM, a mixture of local subspaces is used. An overview of our proposed 
system will be given in Sect. 0 

3 The Adaptive Subspace Map 

The basic idea of the ASM is to find a number of subspaces which describe 
the episodes well. This calls for both a clustering mechanism, which assigns 
episodes to subspaces, and a subspace-finding method, which calculates subspace 
parameters based on the assigned episodes. Although many algorithms could in 
principle be used, here we only consider the PCA algorithm for finding subspaces 
and a fc-means like clustering method we will refer to as the /c-subspaces method. 
The basic algorithm for the adaptive subspace method thus is: 

1. Create a data set by extracting samples from one or more images and make 
episodes Ek by translating, rotating and/or scaling the samples 

2. Assign these episodes randomly to one of the subspaces {Sj,Oj), j = 1, ... ,n 

3. Re-calculate each subspace, using a PCA on the episodes assigned to that 
subspace to find Sj and setting the origin Oj of the subspace to be the mean 
of these episodes 

4. Assign each episode Ek to the closest subspace, which is that subspace 

to which the average sample projection distance is smallest: 



where is the projection of {x — Oj ) onto Sj 

5. While not converged, go to 3 

Note that we use the average distance of all samples as opposed to the minimum 
distance used by Kohonen. This stabilised convergence and did not give very 
different results. Also, in the ASM each subspace has its own origin Oj. 

Episodes are created by taking randomly shifted samples in a certain range 
(translation), by rotating the image over a certain range of angles and taking 
samples (rotation) or by scaling the image over a certain range of scales and 
taking samples (scaling) . For the latter two, the rotation or scale range is divided 
into a number of equally large steps, with a small random offset. 

In the experiments described in this paper, after creation of the episodes, the 
data is pre-processed by subtracting the average grey value from each sample 
and normalising its standard deviation to 1. Note that this means that the origin 
Oj will be zero for all subspaces. 

4 Application 1: Image Database Retrieval 




An ASM trained on episodes collected from images, or even a collection of images, 
can be seen as a descriptor of that image. This is useful for image database 



The Adaptive Subspace Map 



97 



applications, in which it is often a problem to define measures such that image 
content can be described in a compact way. A large body of literature exists 
dealing with indexing images based on their texture content - see e.g. Antani et 
al. PP for a review. As they note, it is impossible to define a good set of features 
beforehand for a wide variety of images; therefore, the best approach is to be 
adaptive. 

In our approach, the feature extraction and feature selection stages are rolled 
into one and performed automatically. All that remains is to define a way of using 
the ASMs to find distances between (classes of) images, say and . There 
are two possible strategies: 

1. Train ASMs A and B on images (or classes of images) and and use 
these as a descriptor 

2. Train an ASM A on image (or class of images) and use the histogram of 
an image mapped onto ASM A as a descriptor (in mapping an image, 
each pixel is assigned to a subspace using the window around it) 

The second option seems to be preferable, as it defines distances between images 
and in terms of the content of both images and the size of the regions of 
similar content. However, it requires mapping each new image (or query image) 
onto all maps found so far, which can be a computationally intensive task. The 
first option is computationally much lighter, but discards information on the size 
of the regions responsible for each subspace in the map. For the image database 
application, we used the first option. The second option is found to be more 
applicable to classification problems. An example is discussed in Sect.0 



4.1 Comparing ASMs 

For the first approach, a distance measure between ASMs A, with subspaces 
Sf, and B, with subspaces Sf, can be defined: 



D{A, B) = max(D'(A, B),D'{B, A)) 



( 3 ) 



D 



'{A,B) = ^J2 min il"((5f,Of),(5f,Of)) (4) 



2=1 • 



71"((5f,Of),(5f,Of)) = -^||s^‘-s 



Ai\\2 
k 1 1 



( 5 ) 



k=l 



Here each d-dimensional subspace S is spanned by basis vectors Si, . . . , and 
s^' is the projection of basis vector k of subspace Sf onto subspace (after 
has been subtracted)Q. The idea behind this measure is to find, for each 

^ Experiments were also performed in which a more principled distance measure bet- 
ween subspaces called the gap [21 was used: D" {Sf , Sf) = \\Pt — Pf\\ 2 , where 
P — S{S^ S )~^ is the projection matrix onto subspace S. However, the compu- 
tational burden of this method was much higher due to the singular value decom- 
position needed for the calculation of the norm, and results were more or less the 



same. 



98 



D. de Ridder et al. 




Fig. 1. Example images from the five sequences, from left to right: news reader, queen, 
cathedral, hut and flag. The regions were used only in the KIDS system. 



subspace in ASM A, the closest subspace in B and average this distance over all 
subspaces in A. The same is done for B and the maximum is taken, like in the 
Haussdorf distance. 

As said before, the problem with this distance measure is that all information 
about the size of the regions responsible for the subspaces is discarded. Therefore, 
one can weight the distance between a subspace Sf" and an ASM B with the 
relative importance of Sf' in ASM A: 






yf min 



D"i{st,Of),{Sf,Of)) 



( 6 ) 



It was found that 



,4 ^ - ef) 

Ui 



EL. hf 



(7) 



gave good results, where hf is the bin of the histogram of the mapping 
of the training image onto ASM A, and ef is the average projection error of the 
pixels assigned to subspace Sf. 



4.2 Experiments 

A small image database was created containing images taken randomly from 
the MPEG7 database and stills from an hour-long video sequence of Sky news. 
Besides these images, five sequences of similar images of a news reader, HRH 
the Queen, Guildford cathedral, a hut, and a US flag were added. The goal of 
the experiment was, given an image in a sequence, to find the other images in 
that sequence. Examples of the images in these sequences are given in Fig. E 
All 200 database images, originally 24-bit colour, were converted to grey values 
and histogram stretched. 

On each individual image, ASMs were trained for various settings of the 
number of subspaces n, subspace dimension d and sample window size w. To 
learn about the influence of the episodes, two experiments were run. In the first, 
for training, 200 episodes were created, each containing 5 samples translated 
randomly in a [—5,5] pixel range. In the second, episodes contained 15 samples: 
5 translated as before, 5 rotated over a [—45, 45] degree range and 5 scaled in a 
range of [1.0, 1.5] times the original size. In both cases a round sampling window 
was used, with a diameter of w pixels. 
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Table 1. Results on the image database retrieval problem. Numbers shown are ranks 
of the images sought in the retrieval order. 



Sequence 


Translation invariance ASMs 


All-invariance ASMs 


KIDS 


2D subspaces, 
n ^ 20,u; ^ 12 


4D subspaces, 
n — 8, ro — 16 


2D subspaces, 
n ^ 20 , 10 ^ 8 


4D subspaces, 
n ^ 12,10 ^ 8 


news reader 


1,6,7,18 


1,2, 3, 7 


1,2,4,12 


1,2, 3, 4 


17,19,20,23,25 


queen 


4,10,16,20 


4,6,16,22 


1,5,20,21 


4,6,20,36 


14,15,17,21 


cathedral 


55,124,184,198 


110,122,179,193 


126,192,194,198 


110,186,197,198 


2, 4, 5, 6 


hut 


2,5,16,50 


1,2,3,40 


7,9,12,57 


1,2,3,83 


1,2, 3, 5 


flag 


1,2,4,5,10 


11,28,29,44,73 


1,2,3,4,13 


1,2,3,8,34 


24,25,30,35,50 


cathedral (2) 


12,50,59,92 


16,17,25,108 


1,22,50,111 


4,5,23,97 


- 



After training, for each sequence r one image was used as a test image. 
The distance Ai),i = 1, . . . , 199 between the ASM A*®* trained on this 

image and all other ASMs was calculated using equations I.SI7I and the images 
were ordered by the distance of their ASM to that of the test image. Finally, 
the ranks of the other images in sequence r, were noted. Table G1 shows the 
results for various settings of parameters. 

As a comparison, the same queries were performed using a state-of-the-art 
system called KIDS 0. This system can also handle colour features, but for the 
comparison only texture features (DCT, Gabor and wavelet) were used. KIDS 
requires the user to specify regions to search for in the database; the regions 
used are shown in Fig. [JJ A threshold of 0.5 was used which gave optimal overall 
results (for more information, see |B|). The results are also given in Tabled 

The table shows that the ASM method gives promising results, even com- 
pared to KIDS. For most of the query images the other images in the sequence 
are ranked high. Also, training on episodes with translated, rotated and scaled 
samples pays off for most sequences. Notable exceptions are the cathedral and 
hut sequences, which are the only two sequences consisting of high-resolution 
images, with areas of fine texture. The ASMs code this texture quite precisely, 
so that different views of the cathedral give quite different ASMs. If the ca- 
thedral experiment is repeated on reduced versions of the images (by 50% in 
both the X and y direction), losing the high-frequency textures, results are much 
better; see the cathedral (2) row in Tabled Interestingly, the two problem que- 
ries are where KIDS performs best, indicating that the two techniques might be 
complementary. 

Of course, it is possible to use a large number of settings for the number 
n of subspaces to use and the sampling window size w. Unfortunately, space 
does not permit giving these results. For most other settings the overall results 
were worse, although on individual sequences they would sometimes be slightly 
better. 

5 Application 2: Object Image Classification 

The ASM can also be used as an adaptive description of a class of images. The 
histograms of an image mapped onto a class ASM can then be used for classifi- 
cation. This idea has been explored earlier by e.g. Idris and Panchanathan 
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who use vector quantisation on an entire set of images and use histograms of 
images mapped onto the code book vectors as descriptors. Another example is 
the work of Lampinen and Oja in which clusters are found in a space span- 
ned by Gabor filters at different resolutions, and a supervised layer is applied 
for classification. 

5.1 Experiments 

For our experiments, a small data set of images of 6 different classes (book cases, 
chess pieces, mugs, workstations, a tea flask and some bridges) was created. All 
images were acquired using a Sony digital camera, re-sized to 320 x 240 pixels, 
converted to grey values and histogram stretched. Per class, 9 training images 
and 6 test images were used. The intra-class variation between objects was quite 
high, since objects were photographed at different distances and against different 
backgrounds. Also, the inter-class distance was kept low for the object images 
(chess pieces, mugs and the tea flask) by taking photographs of each of these 
against three different backgrounds. For some examples, see Fig.|21 

An ASM Ac with n subspaces of d dimensions each was trained for each class 
c = 1, . . . , 6, on 900 episodes taken from 9 training images using a round 
window with a diameter of w pixels. Episodes contained 5 samples translated 
randomly in a [—5,5] pixel range. After training, for each class c all training 
images were mapped onto their ASM Ac, and an n-bin class histogram ft,*™ 
was created by counting the relative number of pixels assigned to each subspace 
(i.e. the number of pixels assigned to a subspace divided by the total number of 
pixels in the image). The mean pc and covariance matrix Oc of these histograms 
were then used as class descriptions. 

Each test image (class k = 1, . . . , 6; image j = 1, . . . , 6) was mapped 
onto each of the class ASMs Ac- The histograms ft^®* ^ of these mappings were 
then used to calculate the Mahalanobis distance to each of the classes: 

Ae) = {hl^ic - - ^c) ■ (8) 

Due to the small number of training images, some regularisation was necessary: 
Cc = Cc + 10“"^ I. Each test image was then assigned to that class c which 
gave the lowest Mahalanobis distance: 

c = arg min Dm ,Ac) . (9) 

The results of these experiments, performed for a range of n = 8, 12, 16 or 
20 subspaces, subspace dimensions d = 2 or 4, and a sample window size of w 
= 8 or 16, are shown in Fig. 01 The best result obtained is a test error of 11% 
(4 out of 36 images classified incorrectly), which is quite reasonable given the 
difficulty of the problem. The window size does not seem to be too important, as 
for both w = 8 and w = W the optimum is reached. There is an optimal number 
of subspaces, but again the exact choice seems not too critical. Furthermore, 
2-dimensional subspaces seem to be sufficient; 4-dimensional subspaces lack the 
distinctiveness needed to describe class-specific image information well. 
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Fig. 2. Some training images in the object image data set: book cases, workstations, 
bridges, chess pieces (3x), mugs (3x) and tea flasks (3x). 



Class map, 2D subspaces 



Class map, 4D subspaces 




Number of subspaces ( n) 



Number of subspaces ( n) 



Fig. 3. Test error for various settings of the number of subspaces n, the sample window 
size w and subspace dimensionalities d = 2 (left) and d — 4 (right). 




Fig. 4. The four incorrectly classified images. 




Fig. 5. Feature ranking for an image containing a chess piece (left), a mug (middle) 
and a tea flask (right). The numbers indicate the ranks for some regions. 
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It is interesting to look at the misclassifications of the best performing ASMs. 
Mostly the same test images are misclassified for a variety of settings: three 
images in the mugs class and one workstation image. These are shown in Fig. ^ 
The three images of mugs are the only three images (in both train and test 
set) in which the ears of the mug are not visible. These images are classified as 
chess pieces. The workstations image is the only one in which a row of books is 
also visible, and is labelled as a book case. In all these cases, the difference in 
Mahalanobis distance between the true class and the class the image was labelled 
as, was small. 

5.2 Feature Extraction 

To investigate what features are found by the ASMs, and whether the ASMs 
do not merely describe the background, the subspaces were ranked as follows. 
Only the classes of images containing chess pieces, mugs and teaflasks were 
considered, as these shared the same backgrounds. Each subspace Sf", i = 
I, ... ,n representing class c was ranked by calculating the average Mahalanobis 
distance of the training images of the other classes to that class c, using only 
mapping histogram bin hi. This distance was then used to label the image: the 
brighter the colour, the better the feature. Figure shows three examples of 
these rankings. The background is mapped on just a few subspaces, more or less 
randomly, but some informative features have been found for each of the classes: 
curved edges and uniform regions for the chess pieces, the curved ears of the 
mugs and curves and large uniform regions for the tea flask. Of course, it is the 
Mahalanobis distance that makes use of these features. 

6 Conclusions 

An adaptive image description method was presented, which uses subspaces to 
describe translated, rotated and/or scaled versions of patches extracted from 
images. The resulting description (ASM) can be used to segment images PI, but 
in this paper the focus was on using it for classification purposes. 

First, given the distance measure between ASMs introduced in Sect. 01 ASMs 
can be used to compare images. This method was shown to be applicable to image 
database retrieval problems. Although the database used was small, the results 
were quite good compared to a state-of-the-art system, given the fact that only 
texture and edge information is used (since the average grey value is removed 
from the samples). 

Second, mappings of images on ASMs can be used. This was demonstrated on 
an object recognition problem. Histograms of images mapped onto class-specific 
ASMs indicate the size of regions present in an image containing a specific texture 
or structure. These histograms can then be used to classify images. Experiments 
show that significant features are found and recognised. 

In the experiments performed some important settings, such as the num- 
ber of subspaces to use within each ASM, subspace dimensionality and sam- 
pling window size, were optimised by hand. For easier applicability, it would 
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be interesting to investigate optimising these automatically, or perhaps to use a 
multi-scale approach. Another possible extension is the combination of the tech- 
nique with other feature extraction mechanisms, the most important of which 
would be the use of colour information. Finally, we intend to investigate other 
subspace-finding algorithms, such as independent component analysis (ICA) or 
even non-linear methods such as multi-dimensional scaling (MDS), which might 
be more applicable to some problems. 
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Abstract. This paper is concerned with the detection of dim targets in cluttered 
image sequences. It is an extension of our previous work Q in which we viewed 
target detection as an outlier detection problem. In that work the background 
was modelled by a uni-modal Gaussian. In this paper a Gaussian mixture-model 
is used to describe the background in which the the number of components is 
automatically selected. As an outlier does not automatically imply a target, a final 
stage has been added in which all points below a set density function value are 
passed to a support vector classifier to be identified as a target or background. 
This system is compared favourably to a baseline technique ll2l . 

Keywords: Automatic target recognition. Mixture Modelling, Support Vector 
Machines, Outlier Detection. 



1 Introduction 

Automatic Target Recognition (ATR) is concerned with the detection, tracking and re- 
cognition of small targets using input data obtained from a multitude of sensor types 
such as forward looking infrared (FLIR), synthetic aperture radar (SAR) and laser radar 
(LADAR). Applications of ATR are numerous and include the assessment of battlefield 
situations, monitoring of possible targets over land, sea and air and the re-evaluation of 
target position during unmanned missiles weapon firing. 

An ideal system will exhibit the properties of a low false positive rate (detection of 
a non-target as a target), whilst obtaining a high true positive rate (the detection of a 
true target). This performance should be invariant to the following parameters: sensor 
noise; time of day; weather types; target size/aspect and background scenery. It should 
be flexible such that it has the ability to detect previously unseen targets and be able to 
retrain itself if necessary. It is unlikely that one single system will cope well with all these 
possible scenarios Q] . The many challenges produced by ATR have been previously well 
documented in O], Q and Q. 

In this paper an adaptive ATR system is proposed which is suitable for scenes with 
strong clutter which is spatially and temporally highly structured, such as sea glint and 
atmospheric scintillation. In the bootstrap phase a statistical Gaussian mixture-model of 
the background is built by using a set of texture filters. In operation, the same features 
are computed for each new pixel arriving at the sensor input. If the probability density 

F.J. Feni et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 104-1171 2000. 
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value of the of this pixel feature vector falls below a set threshold it is considered as a 
potential target. A low probability density value does not necessarily imply a target, e.g. 
it could be sea glint. For this reason a hnal stage has been added to the system in which a 
support vector machine is used to classify all the outliers as a target or as clutter. This is 
consistent with realistic operational scenarios as the target objects required for training 
can easily be inserted in images synthetically. 

Another novelty of this work is the technique applied to obtain a suitable set of filters 
which ensures that the background/target separation is maximised during training. In our 
previous work |Q we demonstrated that the use of a set of adaptive texture biters to model 
each background outperformed the more traditional Wavelet-based feature extractor. This 
set of biters was designed using Principal Component Analysis on randomly sampled 
image patches taken from a training image. This ensured that these biters had a mean 
response when presented with a similar looking texture. If an object with different 
texture, such as a target, is presented to the biter the resulting response should be non- 
mean, making its detection as an outlier easier. In this paper the biter design methodology 
is enhanced further to take into account the temporal dimension of the image data, i.e. the 
PCA is used to build 3-dimensional texture biters. Combining image data from different 
frames prior to detection is commonly known as the track before detect approach, TBD. 

This method is compared to another TBD technique fT3 in which targets are distin- 
guished from the clutter by the analysis of the joint statistics of simple events such as 
glint bashes and regions of persistent brightness. 

The rest of this paper is organised as follows: in the next section the DERA ATR 
system is brieby reviewed before our target detection algorithm is detailed in full. In 
section 4 experiments on two image sequences are performed. Finally, some conclusions 
are drawn. 

2 Multivariate Conditional Probability 

In | 12|| a target recognition approach was proposed in which the multivariate statistics of 
space-time structure was used to characterise spatially and temporally highly structured 
clutter, such as sea-glint and atmospheric scintillation. Targets were then recognised as 
unusual events. 

Two three-dimensional biters were manually chosen and consisted of a constant- 
intensity blob biter and a biter tuned to sea glint bashes. A third feature was also used 
which was simply the vertical image co-ordinate. Dim targets were distinguished from 
the clutter by using the joint statistics of these three variables. A low joint-probability 
identibed a possible target. 

3 Adaptive Texture Representation 

We also view the target detection as an outlier detection problem. That is, anything that 
does not normally occur in the background is viewed as a potential target. Our target 
detection algorithm has three basic steps: 

Model Generation The background is described using a Gaussian mixture model. 
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Model Optimisation The model and model size are optimised using training data (if 
available). 

Target Detection Outliers are found by deciding, per pixel, whether it is consistent with 
the model. 

The background is represented by computing a feature vector, f = [?/o, 2/i > • • ■ , Vn], 
for every pixel in the training image. Each represents a measurement obtained by the 
filter. The distribution of these feature vectors is modelled by a mixture of Gaussians. 
Such a mixture model is defined by equationQJ 

M 

P{^) = (1) 

The coefficients P{j) are called the mixing parameters and are chosen such that 

M 

Y^P{j) = l and 0<P(j)<l (2) 

i=i 

Also note that the component functions satisfy the axiomatic properties of probability 
density functions 



J P(x|j)dx = 1 



(3) 



In this work we used the normal distribution with a diagonal covariance matrix for 
the individual component density functions 



P(x|j) 




l|x-/i,|pl 
2.J / 



(4) 



where y,j is the mean of component j and aj is its standard deviation. The optimal 
values of the parameters P{j), y.j and aj are estimated using the Expectation Maximi- 
sation algorithm, [4j. 

The EM algorithm requires, as an input, the number of components to be used for the 
data distribution modelling. This is achieved automatically using the model validation 
method proposed in )1U]. This iterative algorithm systematically increases the model 
complexity until a model validation test is passed. This model selection strategy prevents 
both overfitting and underfitting. 

To detect possible targets in test frames the same set of n features is generated for 
every pixel in the image. Each feature vector, itest, is tested in turn to see whether it 
belongs to the same distribution as the background or is an outlier (i.e. possible target). 
This is done by computing the density function value for that pixel, based on the mixture 
model. If this value falls below a threshold, the pixel is considered an outlier and treated 
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as a possible target. This threshold can be automatically determined from the training 
data. 

There is also a problem of knowing which features to use to ensure the targets and 
background vectors are well separated in the feature space. For this reason a feature 
selection stage was added which selects features using the sequential forward selection 
algorithm 0. 



3.1 Filter Design 

The background regions of an image are described adaptively using Principal Compo- 
nent Analysis (PCA, also known as the Karhunen-Loeve transform). The representation 
adopted is an extension of an earlier method identified as the most promising in |20 
in which we compared a PCA method against a standard Wavelet-based method and 
a method based on Independent Component Analysis. In our previous paper the filter 
design was two-dimensional. In this paper we incorporate the temporal dimension into 
the filter design. 

Principal Component Analysis 0 finds a linear base to describe the dataset. It finds 
axes which retain the maximum amount of variance in the data. To construct a PCA base, 
firstly N random rectangles of size r x c are taken from a set of training images. These 
rectangles are then packed into an r x c-dimensional vector x^, usually in a row-by-row 
fashion. This results in a data set X containing N samples. Assuming that the global 
mean of the vectors in X is zero, the principal components are the eigenvectors of the 
covariance matrix XX^. These are the columns of the matrix E, satisfying 



EDE”^ = XX^ (5) 

where D is a diagonal matrix containing the eigenvalues corresponding to the eigenvec- 
tors in E. The set of 2D filters is then generated by unpacking each row of E^ into a 
filter of size r x c. 

The design of 3D filters, used in all the following experiments, follows the same 
process as for the 2D design, however instead of extracting image rectangles from a 
single image, the rectangles are taken from d consecutive images. This data is then 
unpacked to form a vector of (r x c x d) dimensions. Typically, d is set to 3. 



4 Experiments 



The proposed target detection technique has been applied to several sequences made 
available by DERA Farnborough and compared to the results obtained on the same 
sequence using the multivariate conditional probability (MCP) methods described in 
ala. Typical results are shown in this section on a simulated sequence, SEASIM, and 
on a real sequence, AM. 
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4.1 On SEASIM 

This sequence contains about twenty frames which have been artificially generated using 
a standard ray-tracing package. It represents the scenario of a sensor attached to a ship 
looking out over the ocean. Figure [IJa) shows the first frame of this sequence. Five 
targets have been inserted into this sequence; whose locations are given by the ground 
truth image of figureQJb). These targets are very small (typically one pixel) and represent 
missiles moving towards the observer. The intensity of these targets are lower than the 
maximum intensity of the image and as the targets are moving slowly its pixel intensity 
will vary in time due to aliasing effects. A human observer will find it extremely difficult 
to identify all targets in this sequence. The two methods of target detection were then 
applied to this sequence. 



Method 


Target Position 


Reference 


[1,2, 5, 7] 


Proposed 


[1,2, 3, 4, 5] 



Table 1. Probability ranking of real target 




(a) First image (b) Enhanced ground tmth (the original 

size of each of the objects is 1 pixel). 



Fig. 1. Sequence SEASIM. 



The top ten most likely targets using multivariate conditional probability are shown 
in figure|2ta). The results obtained using the 3D-PCA and mixture modelling are shown 
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in figure 0b). The positions of the targets, for both methods, are shown in table HI As 
one can see only four of the five targets have been recognised and two false positives 
have been identified in the top five for MCP. Using the proposed adaptive method of all 
five targets have been selected as the five most likely. 




(a) Multivariate conditional probability (b) 3D-PCA and mixture modelling 



Fig. 2. SEASIM: Top 10 detections using both methods 



4.2 The AM sequence 

A real infra-red sequence of just 8 frames was acquired. An artificial target was then 
placed in the sea area. The first frame of this sequence along with the ground truth image 
is shown in figure 0 Again manual identification of this target is extremely difficult. 

The top ten most likely targets detected using the multivariate conditional probability 
method are shown in figure 0(a). The results obtained using the 3D-PCA and mixture 
modelling are shown in figure^Jb). For this sequence the MCP method has outperformed 
the proposed adaptive method. The single target was not found in the top 10 most likely 
candidates but is the most likely target identified by the MCP method. Our approach 
labels the real target as the 27th most likely. 



5 Support Vector Machines 

A target is identified if its density value is below a user set threshold. The system is very 
sensitive to this choice of threshold. If the threshold is set too low, targets are missed. 
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(a) First image 



(b) Enhanced ground truth 



Fig. 3. Sequence AM. 




(a) MCP 



(b) Proposed 



Fig. 4. Top 10 detections on sequence AM 
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if it is set too high, many false positives are found. Also the targets may only occupy a 
corner of low density feature space, yet all points in the low density feature space are 
being identified as targets. This would explain the poor results for the AM sequence. The 
real target has been identified as an outlier but there are many other outliers also. To 
alleviate this problem a final stage has been added to the target recognition system which 
involves passing all the points below the threshold to a support vector classifier imi. 
Most of these points will be false positives but still lie on the edge of the distribution. 
Using a classifier should eliminate some of these points. 

Support vector machines have the major advantage that no density values are esti- 
mated. The classifier is designed on the principal of finding a boundary that optimally 
divides the two classes. The SVM boundary leaves the largest margin between the vectors 
of the classes. This makes SVM’s highly insensitive to the curse of dimensionality and 
therefore do not require the large amounts of training data usually required to achieve a 
good general classification. Figure demonstrates a typical decision surface generated 
for 2-dimensional training data. 




(a) The best two selected features 



Fig. 5. Example of decision surface formed by SVM classifier. 



Labelled data is required to train the SVM. Typically the training target data set 
only contains a few vectors so all these must be used. However there are typically tens 
of thousands of known background vectors. It is not possible to use them all to train 
the SVM as the memory space required by the algorithm is quadratic in the number of 
training points. A representative set of points needs to be found. What we have found is 
that selecting random points on the edge of the background distribution (i.e. those with 
low density values) seems to give the best results. This corresponds to classifier boosting 
as advocated by Freund and Shapire 0. 

A trained SVM was applied to the outliers obtained in the AM sequence. In this case 
the SVM only accepted five points as belonging to the target class, all the other outliers 
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were classified as belonging to the background class. These five points were then ranked 
as a function of their distance from the decision boundary formed by the SVM. The 
points along with their corresponding ranks are shown in figure 0 As one can see the 
real target has been labelled as the most probable target in the scene. 




(a) SVM 



Fig. 6. Only 5 detections after using support vector classifier 



6 Conclusion 

In this paper we have demonstrated a system for detecting dim targets in a cluttered 
background, i.e. sea glint. This ATR system has been favourably compared to another 
leading edge technique used as a baseline in our study. Several improvements to our 
system have also been made to our original system presented in m , namely: 

Clutter Model A Gaussian mixture is used to model the feature distribution of the 
background. This allows for a more flexible representation of the clutter. 

Temporal Data The temporal nature of the data is being incorporated into the design 
of the filters, making more robust filters. 

SVM The application of support vector machines to aid the classification of the most 
outlying data points has significantly improved performance. ( It can also be argued 
that a similar performance level increase would of been observed if the outliers 
found by the MCP method were passed through an equivalent SVM). 

An advantage of our approach is that the system is flexible which complies with 
realistic operational scenarios. We assume a model is available of what the sensor is 
looking for, i.e. the target. As a model of the current sensor input has been computed 
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our system can be optimally tuned to distinguish between this target and the current 
background. If the background happens to change or the target model is modified the 
system can be adapted to this new environment. 

In fact, both our method and the MCP method can be seen as complimentary. Both 
are looking for the same targets but each uses a different technique to obtain the posterior 
target probabilities. In theory, it should be possible to achieve a more robust ATR system 
by the combination of these probabilities. 
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Abstract. An efficient recursive algorithm for realistic colour texture 
synthesis is proposed. The algorithm starts with spectral factorization of 
an input colour texture image using the Karhunen-Loeve decorrelation. 
Single orthogonal monospectral components are further decomposed into 
a multi-resolution grid and each resolution data are independently mod- 
eled by their dedicated simultaneous causal autoregressive random held 
model (CAR). We estimate an optimal contextual neighbourhood and 
parameters for each CAR submodel. Finally single synthesized monos- 
pectral texture pyramids are collapsed into the hne resolution images and 
using the inverse Karhunen-Loeve transformation we obtain the required 
colour texture. The beneht of the multigrid approach is the replacement 
of a large neighbourhood CAR model with a set of several simpler CAR 
models which are easy to synthesize and wider application area of these 
multigrid models capable of reproducing realistic textures for enhancing 
realism in texture application areas. 



1 Introduction 

Virtual reality systems require object surfaces covered with realistic nature-like 
colour textures to enhance realism in virtual scenes. These textures can be either 
digitised natural textures or textures synthesized from an appropriate mathe- 
matical model. Digitised solid 3D textures are far less convenient, since they 
involve the 2D digitisation of a large number of cross-sectioned slices through 
some material. Synthetic textures are more flexible than digitized textures, in 
that synthetic textures can be designed to have certain desirable properties or 
meet certain constraints; for example, it can be made smoothly periodic, so that 
it can be used to fill an infinite texture space without visible discontinuities. 
While a digitized texture must be stored in a tabular form and evaluated by 
table lookup, a synthetic texture may be evaluated directly in procedural form. 
Other texture models applications cover image and video compression, image 
restoration, image classification and many others. 

There are several texture modelling approaches published [Ol’i] and some 
survey articles are also available ^ . Our previous paper ^ introduced a fast mul- 
tiresolution Markov random field based method. Although this method avoids 
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the time consuming Markov chain Monte Carlo simulation so typical for applica- 
tions of Markov models it requires several approximations. Simultaneous causal 
autoregressive random fields are appropriate models for texture synthesis not 
only because they do not suffer with some problems of alternative options (see 
mM for details) but they are also easy to analyze as well as to synthesize and 
last but not least they are still flexible enough to imitate a large set of natural 
and artificial textures. 

Multiple resolution decomposition (MRD) such as Gaussian/Laplacian py- 
ramids, wavelet pyramids or subband pyramids present efficient method 

for the spatial information compressing. The hierarchy of resolutions provides a 
transition between pixel-level features and region or global features and hence 
to model a large variety of possible textures. Unfortunately autoregressive ran- 
dom fields, similarly as the majority of other Markovian types of random field 
models |5j, are not invariant to multiple resolution decomposition (MRD) even 
for simple MRD like subsampling and the lower-resolution images generally lose 
their autoregressive property and become ARMA random fields instead. To avoid 
computationally demanding approximations of an ARMA multigrid random field 
by an infinite order (i.e., high order in practice) autoregressive random fields we 
analyze each resolution component independently. 



2 Texture Model 

Modelling general colour texture images requires three dimensional models. If a 
3D data space can be factorized then these data can be modelled using a set of 
less-dimensional 2D random field models, otherwise it is necessary to use some 
3D random field model. Although full 3D models allows unrestricted spatial- 
spectral correlation modelling its main drawback is large amount of parameters 
to be estimated and in the case of Markov models (MRF) also the necessity 
to estimate all these parameters simultaneously. The factorization alternative is 
attractive because it allows using simpler 2D data models with less parameters 
(one third in the three-spectral case of colour textures) . Unfortunately real data 
space can be decorrelated only approximately, hence the independent spectral 
component modelling approach suffers with some loss of image information. 

Spectral factorization using the Karhunen-Loeve expansion transforms the ori- 
ginal centered data space Y defined on the rectangular M x N finite lattice I 
into a new data space with K-L coordinate axes Y. This new basis vectors are 
the eigenvectors of the second-order statistical moments matrix m 

<P = E{YrY^^} ( 1 ) 

where the multiindex r has two components r = the first component is 

row and and the second one column index, respectively. The projection of random 
vector Yr onto the K-L coordinate system uses the transformation matrix 

T=[uJ,u^,u^f 



( 2 ) 
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which has single rows uj that are eigenvectors of the matrix 'P. 

Yr = TYr (3) 

Components of the transformed vector Yr are mutually uncorrelated. 

Texture modelling does not require computationally demanding MRD approxi- 
mations (e.g. PI) because it does not need to propagate information between 
different data resolution levels. It is sufficient to analyze and subsequently ge- 
nerate single spatial frequency bands without assuming a knowledge of some 
global multi-grid random field model. We assume colour texture factorized into 
orthogonal mono-spectral components jS] . These components are further decom- 
posed into a multi-resolution grid and each resolution data are independently 
modeled by their dedicated CAR. Each one generates a single spatial frequency 
band of the texture. An analysed texture is decomposed into multiple resolu- 
tions factors using Laplacian pyramid and the intermediary Gaussian pyramid. 
The Gaussian pyramid Y^^'> is a sequence of images in which each one is a low- 
pass down-sampled version of its predecessor where the weighting function (FIR 
generating kernel) is chosen subject to following constrains: 



Ws = Ws^Ws2 

= 1 
i 

Wi = W-i 

wq = 2wi {I = 1 ) 



The solution of above constrains for the reduction factor 3 (2^ -|- 1) is icq = 
0.5, zii = 0.25 and the FIR equation is now 

■ ( 4 ) 

The Gaussian pyramid for a reduction factor n is 

y W ^ y;) k = l,2,... , (5) 

where 

i>(0) = y ^ 

j," denotes down-sampling with reduction factor n and is the convolution 
operation. 

The Laplacian pyramid ly ^ contains band-pass components and provides a good 
approximation to the Laplacian of the Gaussian kernel. It can be constructed by 
differencing single Gaussian pyramid layers: 

y(fc) _ Y(k)_ ^y(fe+l)^ 



fc = 0,l,... 



( 6 ) 
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where f” is the up-sampling with an expanding factor n. 

Single orthogonal monospectral components are thus decomposed into a multi- 
resolution grid and each resolution data are independently modeled by their 
dedicated independent Gaussian noise driven autoregressive random field model 
(CAR) as follows. 

The causal autoregressive random field (CAR) is a family of random variables 
with a joint probability density on the set of all possible realisations Y of the 
M X N lattice I, subject to following condition: 



p{Y\l,o- ) = (27t) 



(MW-l) ^ (MN-1) 



exp < ~-tr{c 



-I 

T 



Vmn-1 



-I 



where the following notation is used 



( 7 ) 



1 ) ^x{r—l) ) 

r— 1 


( 8 ) 


k^\ 
r— 1 


(9) 


^xy{r-l) = ^ ^ f 

k^l 
r—1 


( 10 ) 




( 11 ) 



The 2D CAR model can be expressed as a stationary causal uncorrelated noise 
driven 2D autoregressive process: 

y). = , (12) 

where 7 is the parameter vector 

7 = [ai, ...,ar,] , (13) 

r] = card(I^) , (14) 

is a causal neighbourhood, Cr is a white Gaussian noise with zero mean and 
a constant but unknown variance and X^. is a corresponding vector of 
Yr-s (design vector). 
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3 Parameter Estimation 

The selection of an appropriate CAR model support is important to obtain good 
results in modelling of a given random field. If the contextual neighbourhood 
is too small it can not capture all details of the random field. Inclusion of the 
unnecessary neighbours on the other hand add to the computational burden and 
can potentially degrade the performance of the model as an additional source of 
noise. 

The optimal Bayesian decision rule for minimizing the average probability of 
decision error chooses the maximum posterior probability model, i.e., a model 
Mi corresponding to 

max{p (Mj I F ^ ^ } 

i 

where denotes the known process history 

= {F,_i,i;_ 2 ,...,ri,A„A,_i,...,Xi} . (i5) 

If we assume uniform prior for all tested support sets (models) the solution can 
be found analytically. The most probable model given past data is the model 
Mi (/): j) for which 



i = argmax{Dj(r-i)} 



^j{r—i) In T 




- ri + 2 
^2 






1 

2 



ln|I4(r-i)| 



P{r) -77 + 2 
2 



ln|A(^_i)| 



(16) 



where 



(3{r) = /3(0) + r - 1 , 
/3(0) > 1 , 



(17) 

(18) 



and 

^(r) = K/(r) — K:y(r)K;(r)^^y(r) ' (1^) 

Parameter estimation of a CAR model using the maximum likelihood, the least 
square or Bayesian methods can be found analytically. The Bayesian parameter 
estimations of the causal AR model with the normal-gamma parameter prior 
which maximize the posterior density are: 



M-l — K:(r-l)^^y(r-l) 

and 



(20) 



-2 _ 

/ 3 (r) 



(21) 
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or al ml 

Fig. 1. “2^1 Wood texture of the ninth order (or) and its single (1) and multiple 

(2,3,4) scales resynthesis using the CAR (a) and MRF (m) models. 



where Vz(r-i) = ^z(r-i) + Vz(o) and matrices T4(o) are from parameter prior. 
The estimates (Hi, (E2J,(EI1) can be also evaluated recursively if necessary. 

4 Model Synthesis 

The CAR model synthesis is very simple and a causal CAR random field can be 
directly generated from the model equation (1 1 211 . 

Single CAR models synthesize spatial frequency bands of the texture. Each mo- 
nospectral fine-resolution component is obtained from the pyramid collapse pro- 
cedure (inversion process to (0,(EI)- Finally the resulting synthesized colour 
texture is obtained from the set of synthetized monospectral images using the 
inverse K-L transformation: 



Yr = T~^Yr 



( 22 ) 
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or al ml 

Fig. 2. m 3 mi Natural cloud texture (or) and its single (1) and multiple (2,3,4) 
scales resynthesis using the CAR (a) and MRF (m) models. 



5 Results 

Two presented natural colour examples (Fig0Fig0 violates the model statio- 
narity assumption. Nevertheless they are able to demonstrate an advantage of 
the multiscale approach in texture modelling over single scale models in such 
unfavourable conditions. FigsHEIalso compare CAR synthesis results with re- 
sults from the multiscale Markov model 0 synthesis. Fig.l shows a wood texture 
synthesized using a ninth order (24 different parameters) MRF model. We tried 
to resynthesize this texture with inadequate low order CAR and MRF models. 
The CAR model had only five contextual neighbours and the MRF model was 
of the second order (4 different parameters). Figs. [n(al,ml), El(al,ml), 0 show 
unsatisfactory results using the single-scale texture models while these figures 
simultaneously demonstrate an improvement if we use our presented multi-scale 
model with two, three or four scale levels, respectively. The second example 
Figj2Kor) is a natural cloud texture. The texture is non stationary and thus vio- 
lates the CAR model assumption. Fig|2Ka2-a4) show synthesis results for the 
CAR model while Fig0)m2-m4) show the second order MRF results for two, 





A Multiresolution Causal Colour Texture Model 



121 



three or four-scale models, respectively. Both examples FigllFigEl are colour 
textures and they were converted to the grey scale representation only to be 
printable in the proceedings. The multi-scale models demonstrate their clear su- 
periority over their single-scale counterparts. The colour quality is comparable 
between single-scale and multi-scale models and it is very good in general. 




Fig. 3. Natural green marble texture and its single, four-scale resynthesis using the 
CAR model. 





6 Conclusions 

Our testing results of the algorithm are encouraging. Some synthetic textures 
reproduce given digitized texture images so that both natural and synthetic 
texture are visually indiscernible. The multi-scale approach is more robust and 
allows better results than the single-scale one if the synthesis model is inadequate 
(lower order model, non stationary texture, etc.). The MRF multiscale model 
seems to be superior to the causal CAR model for some textures, however the 
CAR model synthesis is much faster than the MRF model synthesis. The CAR 
model is better suited for real time or web distributed texture modelling appli- 
cations. The proposed method allows large compression ratio for transmission or 
storing texture information while it has very moderate computation complexity. 
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Abstract. This paper is to determine the statistical validity of indivi- 
duality in handwriting based on measurement of features, quantification 
and statistical analysis. In classification problems such as writer, face, 
finger print or speaker identification, the number of classes is very large 
or unspecified. To establish the inherent distinctness of the classes, i.e., 
validate individuality, we transform the many class problem into a dicho- 
tomy by using a “distance” between two samples of the same class and 
those of two different classes. A measure of conhdence is associated with 
individuality. Using ten feature distance values, we trained an artificial 
neural network and obtained 97% overall correctness. In this experiment, 
1,000 people provided three sample handwritings. 

Key Words: Dichotomizer, Hypothesis Testing, Individuality, Writer 
Identification 



1 Introduction 

The Writer Identification problem is a process to compare questioned handwrit- 
ing with samples of handwriting obtained from known sources for the purposes 
of determining authorship or non-authorship. In other words, it is the exami- 
nation of the design, shape and structure of handwriting to determine authors- 
hip of given handwriting samples. Document examiners or handwriting analysis 
practitioners find important features to characterize individual handwriting as 
features are consistent with writers in normal undisguised handwriting Aut- 
horship may be determined due to the following hypothesis that people’s hand- 
writings are as distinctly different from one another as their individual natures, 
as their own finger prints. It is believed that no two people write the exact same 
thing the exact same way. 

Since the writer identification plays an important investigative and forensic 
role in many types of crime, various automatic writer identification by computer 
techniques, feature extraction, comparison and performance evaluation methods 
have been studied (see |H| for the extensive survey). Osborn suggested a sta- 
tistical basis to handwriting examination by the application of the Newcomb 
rule of probability and Bertillon was the first who used the Bayesian theorem to 
handwriting examination [S| . Hilton calculated the odds by taking the likelihood 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 123-C23 2000. 
© Springer- Verlag Berlin Heidelberg 2000 



124 



S.-H. Cha and S.N. Srihari 



ratio statistic that is the ratio of the probability calculated on the basis of the 
similarities, under the assumption of identity, to the probability calculated on 
the basis of dissimilarities, under the assumption of non-identity m Howe- 
ver, relatively little study has been carried out to demonstrate its scientific and 
statistical validity and reliability as forensic evidence. To identify writers, it is 
necessary to determine the statistical validity of individuality in handwriting 
based on measurement of features, quantification, and statistical analysis. 

Consider the multiple class problem where the number of classes is small and 
one can observe enough instances of each class. To show the individuality of class 
statistically, one can cluster samples into classes and infer it to the population. 
It is an easy and valid setup to establish the individuality. Now consider the 
many class problem where the number of classes is too large to be observed (n is 
very large). Most pattern identification problems such as writer, face, fingerprint 
or speaker identification fall under the aegis of the many class problem. Most 
parametric or non-parametric multiple classification techniques are of no use and 
the problem is seemingly insurmountable because the number of classes is too 
large or unspecified. 

To establish the inherent distinctness of the classes, i.e., validate individua- 
lity, we transform the many class problem into a dichotomy by using a “distance” 
between two samples of the same class and those of two different classes. We 
tackle the problem by defining a distance metric between two writings and fin- 
ding all writings which are within the threshold for every feature. In this model, 
one need not observe all classes, yet it allows the classification of patterns. It is 
a method for measuring the reliability of classification about the entire classes 
based on information obtained from a small sample of classes drawn from the 
class population. In this model, two patterns are categorized into one of only 
two classes; they are either from the same class or from the two different classes. 
Given two handwriting samples, the distance between two documents is first 
computed. This distance value is used as data to be classified as positive (aut- 
horship, inner-variation, within author or identity) or negative (non-authorship, 
intra- variation, between different authors or non-identity) . We use within author 
distance and between authors distance throughout the rest of this paper. Also, 
we use subscriptions of the positive (©) and negative (0) symbols as the nomen- 
clature for all variables of within author distance and between authors distance, 
respectively. 

The subsequent sections are organized as follows. The section El discusses 
the dichotomy transformation. The section E| shows the experimental database 
of writer, exemplar and features. In section 0] the full statistical analysis of 
the collected database and gives the experimental results. Finally, the section 0 
concludes the paper. 

2 Transformation of Polychotomizer to Dichotomizer 

The writer identification can be viewed as a U.S. population category classifica- 
tion problem, so called Polychotomizer. As the number of classes is enormously 
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large and almost infinite, this problem is seemingly insurmountable. In this sec- 
tion, we show how to transform a large polychotomizer to a simple dichotomizer, 
a classifier that places a pattern in one of only two categories. 

To illustrate, suppose there are three writers, {Wi,W 2 ,W 3 }. Each writer 
provides three documents and two scalar value features extracted per document. 
Fig. □ (a) shows the plot of documents for every writer. To transform into di- 





Fig. 1. Transformation from (a) Feature domain to (h) Feature distance domain 



stance space, we take the vector of distances of every feature between writings 
by the same writer and categorize it as a within author distance denoted by X;^. 
The sample of between author distance is, on the other hand, obtained by mea- 
suring the distance between two different person’s handwritings and is denoted 
by Xq. Let dij denote Tth writer’s j’th document. 

= 8{dij — dik) where i = 1 to n, j , k = 1 to m and j ^ k (1) 

Xq = 8{dij — dki) where i,k = 1 to n,i ^ k and j,l = 1 to m (2) 

where n is the number of writers, m is the number of documents per person, 6 
is the distance between two documents. Fig. C] (b) represents the transformed 
plot. The feature space domain is transformed to the feature distance space 
domain. There are only two categories: within author distance and between author 
distance. 

It would be desirable if all distances between the same class (writer) in feature 
domain belong to the within class distance class in feature distance domain. 
Similarly, we would like all distances between two different classes in feature 
domain belong to the between class distance class in feature distance domain. 
Unfortunately, this is not always the case; perfectly clustered class in feature 
domain may not be perfectly dichotomized in feature distance domain. Thus, we 
have a trade-off between tractability and accuracy. Since sampling a sufficiently 
large sample from each individual person is intractable, we may wish to transform 
feature domain to the feature distance domain where we can get large samples for 



126 



S.-H. Cha and S.N. Srihari 



both classes. By the transformation, the problem becomes a tractable inferential 
statistic problem but we might get the lesser accuracy. 



3 Experimental Database 

There are three steps to validate the individuality of handwriting: i) data collec- 
tion, ii) feature extraction and iii) statistical analysis. In this section, we discuss 
the first two issues and the following section 01 covers the statistical analysis. The 
first one is data collection of writers, exemplars, and features. We collected seven 
attributes of writers through the questionnaire data-sheet. They are gender, age, 
handedness, highest level of edueation, country or states of primary education, 
ethnicity and country of birth. We built a database that is “representative” of 
the US Population. This has been achieved by basing our sample distribution on 
the US census data (1996 Projections) jZ]. There are 510 female and 490 male 
population distributions and 36% of white ethnicity group, etc. The database 
contains handwriting samples of 1000 distinct writers. 



3.1 Exemplar: CEDAR Letter 

Each subject provides three exemplars of the CEDAR Letter shown in Figure |21 
The CEDAR Letter is concise (it has just 156 words), easy to understand and 



From Nov 10, 1999 

Jim Elder 

829 Loop Street, Apt 300 
Allentown, New York 14707 



To 

Dr. Bob Greint 

602 Queensberry Parkway 

Omar, West Virginia 25638 

We were referred to you by Xena Cohen at the University Medical Center. 
This is regarding my friend, Kate Zack. 

It all started around six months ago while attending the "Rubeq" Jazz 
Concert. Organizing such cin event is no picnic, and as President of the 
Alumni Association, a co-sponsor of the event, Kate was overworked. But 
she enjoyed her job, eind did what was required of her with great zeal 
and enthusiasm. 

However, the extra hours affected her health; halfway through the show 
she passed out. We rushed her to the hospital, and several questions, 
x-rays eind blood tests later, were told it was just exhaustion. 

Kate’s been in very bad health since. Could you kindly teike a look at 
the results and give us your opinion? 

Thank you! 

Jim 



Fig. 2. CEDAR Letter 
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complete. It’s complete in that, each alphabet occurs in the beginning of a word 
as a capital and a small letter, and as a small letter in the middle and end of 
a word. In addition, it also contains punctuation, numerals, interesting letter 
and numeral combinations (ff, tt, oo, 00) and a general document structure that 
would allow us to extract document level features such as word and line spacing, 
line skew etc. Forensic literature refers to many such documents - the “London 
Letter”, the “Dear Sam Letter” to name a few. But none of them are complete in 
the sense of the CEDAR Letter as follows. All capitals must appear in the letter 
and it is desirable to have all small letters in the beginning, middle and terminal 
positions of the word. We score the letter according to these constraints: 

,, , 104 — Number of O’s 

score(ietter x) = (3) 

The CEDAR letter scores 99% whereas the London letter scores 76%. the cedar 
letter has only 1 zero entry that is a word that ends with a letter “j”. Since 
there is no common English word that ends with the letter “j”, the cedar letter 
excludes this entry. 

3.2 Feature Extraction 

Encouraged by the recent success in off-line handwriting recognition and hand- 
written address interpretation |3|, we utilize the similar features for the indivi- 
duality validation. Albeit there are numerous features in line, word, character 
and spacing features, we give some document level computational features. 

The darkness value is the threshold value that separates the character parts 
and background parts of the document image. A digital image is a rectangu- 
lar array of picture elements called pixels and each pixel has a darkness value 
between 0 and 255. A histogram is built and it has two peaks. One is due to 
dark handwritten characters and the other is due to the bright background. The 
valley between two peaks is the grey level threshold. We use the darkness value, 
grey level threshold value as an indicator of pen pressure. Another document 
level feature is the number of blobs that is the number of connected components 
in the document image. A blob is also known as an exterior contour. This feature 
is related to intra-word and inter-word connections. Those writers who connects 
characters or words have fewer number of blobs while those who do not connect 
have lots of blobs. A similar feature is the number of holes that is the number 
of closed loops. A hole is often called an interior contour or a lake. This fea- 
ture gives the tendency of making loops while writing. The average stroke width 
feature is computed by measuring the highest frequency of width per line. We 
compute the slant, skew and average height of character features. 

4 Analysis 

4.1 Size of Sample 

Let 71,0 = |a;0| and uq = |a;e|. 
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Fact 1 If n people provide m writings, there are ri 0 = (™) x n positive data, 
Uq = m X m X negative data and (™"') data in total. 

Proof. 77,0 = (™) X 77 is straight-forward. To count the negative data, we can 

enumerate them as mx (mx (n— l))-|-mx {mx {n-2))-\ l-rrix (mx 1). For the 

first author, there are mx (n— 1 ) number of other writer’s writing data and he has 
three writing data. For the second author, there are mx (n — 2) number of other 
writer’s writing data that are not counted yet. Therefore, Uq = mxmx *■ 

Now, 770 -I- 770 must be . 




{mn)\ (77777) (77777 — 1 ) 

(77777 — 2)12 2 



777(777—1) 977(77—1) 



n®+nQ 



□ 



In our data collection, 1000 people (statistically representative U.S. population) 
provide exactly three samples. Hence, there are 770 = 3000, Uq = 4, 495, 500 and 
4, 498, 500 data in total. 

Most statistical testing requires the assumption that observed data be sta- 
tistically independent. The distance data is not statistically independent: one 
obvious reason being the triangle inequality of three distance data of the same 
person. This caveat should not be ignored. One immediate solution is to choose 
randomly a smaller sample from a large sample obviating the triangle inequality. 
One can partition 770 = 3000 data into disjoint subsets of 500 guaranteeing no 
triangle inequality. 



4.2 Feature Evaluation 

A good descriptive way to represent the relationship between two populations is 
calculating overlaps between two distributions. Fig. 0 illustrates the two distri- 
butions assuming that they are normal. Although this assumption is invalid, we 
use it to describe the behavior of two population figuratively. The type I error, 
a occurs when the same author’s documents are identified as different authors 
and the type II error, (3 occurs when the two document written by two different 
writers are identified as the same writer as shown in Figure 0 

a = Pr{dichotomizer{dij,dki) > T\i = k) (4) 

/3 = Pr (dichotomizer(dij, dfe/) < T\i ^ k) (5) 



Let X denote the distance x position where two distributions intersect. As 
shown in Fig. 0 type 1 error is the right side area of positive distributions where 
the decision bound T = X. Suppose one must make a crisp decision and choose 
the intersection as a classification bound. Then the type one error means that 
the probability of error that one classifies two writings as different authors even 
though they are written by a same person. The type 2 error is the left side area 
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Fig. 3. Type I and II errors 




Fig. 4. Three Feature Space Distributions 
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Table 1. Evaluating Features by overlaps 





5a 


Sb 


5c 


Sd 


5e 


6f 


Sabce 


X 


0.0172 


0.1029 


0.0825 


0.0317 


0.0576 


1.7300 


0.1407 


Type 1 error 


9.0% 


6.94% 


5.0% 


24.54% 


0.81% 


3.0% 


3.84% 


Type 2 error 


38.6% 


27.3% 


26.0% 


51.4% 


15.7% 


27.0% 


14.0% 


Rem. 


Good 


Good 


Good 


Bad 


Best 


Good 





of negative distributions meaning the probability of error that one classifies two 
writings as a same author even though they are written by two different writers. 
Table E shows the intersection positions, X’s and proportion of each error for 
each feature. Note that feature (E) is an excellent feature whereas feature (D), 
the average stroke width is a bad one. Note that the last column in Tabled is 
not the multivariate results but the univariate overlaps of the Euclidean distance 
of multiple features. 

Another novel way to handle multiple features is to get the distance value for 
each feature and produce the multi-dimensional vector distances. Fig. d illustra- 
tes three dimensional distance values, <5c, <5e}. Similar to the one-dimensional 
case, the within author distances tend to cluster toward the origin while the bet- 
ween authors distances tend to be apart from the origin. Various multivariate 
analysis such as Hotelling statistics to test hypotheses on two multivariate 
means but we use the artificial neural network. 

4.3 Dichotomizer: Artificial Neural Network 

Samples of both class are divided into 6 groups of 500 in size. One pair set is 
used as a training set and the other set is used as a validation set. The rest 
of them are used as testing sets. Using ten feature distance values, we trained 
an artificial neural network as shown in Fig. d K Is observed that the higher 




Fig. 5. ANN Dichotomizer Design 



number of features the better the dichotomizer is as shown in Table d 
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Table 2. Experimental results vs. the number of features 



no. of features 


5 


9 


10 


Type I error 


5.3% 


4.6% 


3.5% 


Type II error 


5.2% 


3.5% 


2.1% 


Accuracy 


95% 


96% 


97% 



5 Conclusion 

In this paper, we showed that the multiple category classification problem can 
be viewed as a two-categories problem by defining the distance and taking those 
values as positive and negative data. This paradigm shift from the polychoto- 
mizer to the dichotomizer makes the writer identification that is a hard U.S. 
population multiple class problem very simple. 

We designed an experiment to show the individuality of handwriting by 
collecting samples from people that is representative of the US population. Gi- 
ven two randomly selected handwritten documents, we can determine whether 
the two documents were written by the same person or not. Our performance is 
97%. 

One advantage of the dichotomy model working on distribution of distances 
is that many standard geometrical and statistical techniques can be used as the 
distance data is nothing but scalar values in feature distance domain whereas the 
feature data type varies in feature domain. Thus, it helps to overcome the non- 
homogeneity of features. Techniques in pattern recognition typically require that 
features be homogeneous. While it is hard to design a polychotomizer due to non- 
homogeneity of features, the dichotomizer simplifies the design by mapping the 
features to homogeneous scalar values in the distance domain. Types of features 
can be nominal, linear, angular, strings, histograms |2), etc. Full discussion on 
the multiple feature integration for writer identification and various distance 
measures can be found in 0. 

5.1 Work on Progress 

Features used in the analysis are document level features. As the segmentation 
tools are developed, features in the line, word, and character level features will 
be applied. The higher performance is expected. 

We are currently dealing with the following five issues: i) comparison bet- 
ween polychotomy and dichotomy: comparing polychotomy in feature do- 
main and dichotomy in distance domain from the view point of tractability vs. 
accuracy, ii) distance measures: use and evaluate several distance measures, 
e.g., element, histogram, probabilistic density function, string, and convex hull 
distances, iii) efficient search: nearest-neighbor algorithms for distance mea- 
sures iv) applications: designing and analyzing an algorithm for writer identi- 
fication for a known number of writers and a method for handwritten document 
image indexing and retrieval, and v) discovery: mining a database consisting 
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of writer data and features obtained from a handwriting sample, statistically 
representative of the US population, for feature evaluation and to determine 
similarity of a specific group of people. 
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Abstract. Although the multidimensional primitives are more power- 
ful than string primitives and there also exist some works concerning 
distance measure between multidimensional objects, there are no many 
applications of this kind of languages to syntactic pattern recognition 
tasks. In this work, multidimensional primitives are used for object mo- 
delling in a handwritten digit recognition task under a syntactic ap- 
proach. Two well-known tree language inference algorithms are conside- 
red to build the models, using as error model an algorithm obtaining 
the editing distance between a tree automaton and a tree; the editing di- 
stance algorithm gives the measure needed to complete the classihcation. 
The experiments carried out show the good performance of the approach. 

Keywords: Syntactic pattern recognition, editing distance, tree auto- 
mata, error correcting parsing. 



1 Introduction 

A wide range of problems are related with pattern recognition. This general 
problem has mainly two approximations: the geometric |^, and the structural 
or syntactic one 

When the choice is the syntactic approach, the first step to consider is the 
modelling of the object domain. This phase is usually followed by an inference 
process to build grammatical models able to keep the structure of the different 
classes. After this, a parse over the models gives a classification by ownership. 

Although the modelling phase has usually been performed by using strings of 
a given alphabet, mainly to take advantage of the huge quantity of tools that exist 
for string automata, there also exist more powerful primitives to model the object 
domain, for instance trees and graph primitives. And since hierarchical features 
or information concerning connections between related areas of the pattern are 
inherent to these primitives, these features could be easily modelled by them 

Q|H|. 

Much work has been done on graphs and graph grammars im, and although 
there even exist graph language inference algorithms jHI, the temporal complexity 
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involved causes make these methods not easy to use in applied tasks. Besides, 
due to the high representation power of these primitives, even very restricted 
families of graphs or graph languages might be useful in pattern recognition 

m- _ 

The description power of the tree primitives and the existence of time efficient 
tools to handle tree representations make this approach interesting. In fact there 
exist several tree language inference algorithms. Some of them use complete 
presentation in the inference process jjj, others characterize a specified family |^, 
or adapt previous string language algorithms. Still others, even though they have 
not been developed to this purpose, could be seen as a tree language inference 
algorithm since the set of structured samples of a string language form a tree 
language m- 

Other works deal with the tree edition problem m, establish a distance 
model between trees ininsjizz!, or study the distance computation complexity 

|^|23|. 

When a real application is wanted, the availability of a big enough set of 
samples that allows the variability of the classes to be retained is a serious 
problem. So when an object is going to be classified, it usually does not fit any 
model, and therefore it is necessary to obtain the nearest model to the object 
structure. There are several methods where string languages or even tree 

languages H2i[m are involved, but the latter have not been frequently used. 

This work uses a tree based pattern recognition approach in a handwritten 
digit recognition task where the classification is carried out by using a recently 
proposed method which obtains a distance measure between a tree and a tree 
automaton. First of all, this work introduces the notation needed. Then the 
approach to be used is explained: the feature extraction procedure, the tree 
language algorithms used and the error correcting algorithm which will provide 
the error model, and the series of experiments carried out. Finally, a summary 
of the best results, the conclusions and the proposed future lines of work are 
exposed. 

2 Theoretical Concepts and Notation 

Let V be an alphabet and IN the set of natural numbers, a ranked alphabet is 
defined as the association of V with a finite relation r C {V x IN). Vn denotes 
the subset {a G V\{a, n) £ r}. 

Let V'^ be the set of finite trees whose nodes are labelled with symbols in V, 
where a tree is defined inductively as follows: 



VoQV^ 

cr(ti, . . . ,t„) G : Vti, . . . G 

cr G 14 
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Let the root of a tree t, denoted by root{t), be: 

root{a) = a : \/a € Vq. 

root{a{ti, . . = a : Vti, . . . S 

a GVn 

Let the size of a tree t (|t|) be: 

|a| = 1 : 

|cr(ti,...,t„)| = 1+ X! 1^*1 

A deterministic tree automaton is defined as the four-tuple A = (Q, V, S, F) 
where Q is a finite set of states; is a ranked alphabet; F C Q is a set of final 
states and i5 = (Jqi ■ ■ • i Sm) is a finite set of functions defined as: 



Va G Vo. 

Vti, G V^, a €Vn 



5n ■ {Vn X (Q U vb)") -t Q n = 1, . . . , TO 

(5o(n) = a Vn G Vq 

6 can be extended to operate on trees as follows: 



6 :V'^ ^ QUVo 

6{a{ti, tn)) = Sn(cr, S(ti), 6{tn)) if 71 > 0 

(5(a) = a Va G Vb 

A tree t C is accepted by A if S{t) G F. The set of trees accepted by A 
is defined as L{A) = {t G V'^\S{t) G F} 

3 Syntactic Pattern Recognition 

Although there exist different approaches in the literature to perform a syntactic 
classification 00, the traditional way to undertake the object modelling, is the 
use of strings from a given alphabet. 

Once the grammatical models are obtained, several methods exist to work 
out a distance measure between the samples to be classified and the models 
00IZIIP1I, allowing the classification of noisy samples. 

There also exist several works which establish an edit distance between trees 
Hg[iniPQ|E2IE3, or between a tree and a tree automaton nnini, but however 
there are few applications of these works to real tasks nni. 

In this work, we consider two tree language inference algorithms jTij jl iS) to 
obtain tree automata as class models in a handwritten digit recognition task. 



136 D. Lopez and I. Pinaga 



Both algorithms deal with a kind of trees without labels in their internal nodes, 
named skeletons, so, we will transform trees into skeletons making form now on 
no distinction between them. 

Once the automata are obtained, the samples are classified considering the 
distance obtained by application of the error correcting analysis algorithm pro- 
posed in cn. The experiments carried out show the good performance of the 
algorithm. 

3.1 Feature Extraction 

All the binary images of handwritten digit samples used in this work0 both 
in the training and test phases, were previously thinned using the Arcelli and 
Sanniti algorithm Q. These simplified images were the starting point of the tree 
representation procedure, which may be summarized as follows: 

1 . The upper leftmost pixel of the thinned image was assigned to the root node 
of the tree, and is assigned as label. 

2. Each node of the tree has as many branches as neighbours has the pixel 
assigned to the node. 

3. Each branch stretches until one of the following conditions is fulfilled: 

- The branch reaches the length established in the parameter window. 

— There are no more neighbours to the pixel (a final pixel is found). 

- A pixel with more than one neighbour is found, (an intersection pixel). 

Once the end of the branch is found, its final pixel is assigned to a new node. 
The node label comes from the scheme explained in figure ^ 

4. For each node with neighbours, go to step 2. 

The trees which are obtained by this procedure have labels in their internal 
nodes. To transform these trees into skeletons without loss of information the 
operator “5'fc” was applied: 



One original sample, the image after the application of the thinning algorithm 
and the tree representation, are shown in figure El 

1 from the data set “NIST SPECIAL DATABASE 3, NIST Binary Images of Hand- 
written Segmented Characters” . 



Sk{a) = a : 

Sk{a{ti, . . .,tn)) = cr(a, Sk{ti ), . . . , Sk{tn)) : 



Vfi, . . . e 

(X G Vn j 



Va G Vo- 



a ^ V 
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Fig. 1. These equations divide the 2-D space into eight regions, as shown on the left; 
let the northern one be the region with label “a” , and clockwise, let the rest be labelled 
consecutively. When the label of a segment has to be obtained, the starting pixel shall 
be placed at the origin of the axes; the labels is assigned to the Hnal pixel depending 
on its relative situation to the starting one. 




Fig. 2. A digit sample, the thinned sample obtained, and the tree representation when 
the window is equal to 8 are shown. 



3.2 Error Correcting Analysis 

Due to the fact that almost all the samples to be classified in real tasks are 
distorted or have some kind of noise, the development of error models is crucial 
in syntactic pattern recognition applications. 

The algorithm we test in this work El explores each tree in postorder, cal- 
culating the cost of reducing each subtree to each one of the states of the au- 
tomaton. Briefly, the method works out the distance of every subtree to every 
state of the automaton, to do that the algorithm compares the successors of a 
node with the different ways the automaton can produce each state. 

Although the insertion, deletion and substitution costs vary depending on 
the node being analyzed, a dynamic programming scheme allows the algorithm 
to obtain the distance with polynomial time complexity with respect to the size 
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of the tree and the automaton. The authors prove that the distance obtained is 
the minimum one according to the edit operations proposed. 

3.3 Experimentation and Results 

In order to test the behaviour of the edit distance algorithm, two tree language 
inference algorithms were used um, both of them working with positive data. 

The first of them [^, obtains a k-Testable in strict sense tree automaton (k- 
TSS tree automaton) , which is basically defined by a set of substructures of size 
k, allowed to appear in the members of the tree language. In case that the target 
language were not a k-testable language, the algorithm obtains the smallest k- 
TSS tree language that contains the training set. Varying the values of k, a 
hierarchy of languages could be obtained. In that way, the higher the value of 
k, the bigger class; this is the reason why several values of the parameter k have 
been tested in the experimentation. 

The second algorithm HSI, introduces a context-free normal form named 
reversible context free grammars, and gives an algorithm to learn context free 
grammars in this normal form with positive samples and structural information 
of them, that is, the derivation skeleton of the samples. Since every context free 
grammar has an equivalent reversible one, the algorithm learns the context free 
class with positive structural information. Because this structural information 
(sets of trees) could be seen as members of a tree language, the adaptation to 
this kind of multidimensional languages is straightforward. 



window 


in-order string representation 


16 


(@(d(f(e(g))(g)))(f(e(f)))) 


8 


(@(c(e(f(f(e(f(g)))))))(g(e(d(c)(f(e(d))))))) 


4 


(@(c(c(e(f(f(g(f)))))))(g(f(e(e(d(c(e(e(e(g(g))))))(g(f(e(e(d))))))))))) 



Fig. 3. Differences between trees obtained with different window sizes. All the trees 
represent the same digit (the one in figure E|). Notice that the bigger the window, the 
more compact the representation. 



The classification was carried out with and without considering the editing 
measure. Due to the fact that the parameter window modifies the tree represen- 
tation (as shown in figure OJ, it was also considered in the experiments. 

Moreover, a classification scheme by voting was implemented. Into this scheme, 
being Dk and Dr the lower distance among the k-testable and reversible models 
respectively, and Ck and Cr the set of models with distance Dk and Dr, the 
classification was completed following the algorithm 1 1 . II 

Several sizes of the training set were tested (100, 200, 300 and 400 samples), 
testing the models with the same 3000-sample set. When the k-testable algorithm 
was used, values of k between 2 and 6 were considered. A brief comparative of 
the results obtained is showed in figure 0 An extended version of these results 
can be found in uni 
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Algorithm 1.1 Classification by voting scheme. 

Input: Ck, Cr, Dk, Dr 

t, sample to be classified 
Output: C, best class to classify t 

Method: if |Cfc| == \Cr \ == 1 and Dk == Dr 

C = Ck 

fi 

if \Ck\ == \Cr\ == 1 and A < Dj : i,j & {k,r} 
C = Ci 

fi 

if I Cfc I > 1 or \Cr\ > 1 

if lAnai == 1 
c = Cfc n a 

else 

if Di < Dj and \Ci\ == 1 : i,j £ {k,r} 

C = Ci 

else t could not be classified 

fi 

fi 

fi 

EndMethod: 





algorithm 


window size 


% 


200 


3-TSS 


4 


15.17 / 43.50 


200 


4-TSS & rev 


4 


9.57 / 80.10 


300 


3-TSS 


6 


23.53 / 49.43 


300 


4-TSS & rev 


6 


10.23 / 83.83 


400 


3-TSS 


4 


21.30 / 36.77 


400 


4-TSS & rev 


4 


17.74 / 79.70 



Fig. 4. Summary of results. From left to right: size of the training set, algorithm used 
(k-testable or votation), size of the window and classihcation rate (with and without 
error model). 



In every case, the votation strategy gave the best results, being able to clas- 
sify correctly up to a 30% more samples than the models obtained directly by 
inference. 

As expected, the experimentation showed that 100 samples were not enough 
to model all the variance in the classes. Using 400 samples, an early over- 
generalization was observed together with very similar distances between classes 
when the error model is taken into account, increasing in that way the ambiguity 
rate. 

The best results without error model were obtained with the 3-TSS, 6 as 
window size and a training set of 300 samples (23.53%). All the training set sizes 
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proved, gave reversible models with poor generalization, obtaining therefore bad 
results. Using an error model, the votation scheme, training set of 300 samples 
and 6 as window size, the 83.83% was reached. 

4 Conclusions and Future Work 

In this work a syntactic pattern recognition task is carried out. Multidimen- 
sional (tree) primitives were used to model the objects in a handwritten digit 
recognition task. These primitives give more representation power than string 
primitives. Two tree language inference algorithms um were used to build the 
models that, together with a mixed strategy, were considered to carry out the 
classification. A editing distance algorithm between trees and tree automata was 
used to obtain an error model. 



cr 










6 




f 







Fig. 5. Examples of noisy digits. From left to right and downwards: zero, two, four, 
four, five, six, eight, eight, nine, and nine. 



The results obtained prove the validity of the approach. Nonetheless several 
of the thinned samples showed noise (figure EJ which opens the possibility of 
improving the results introducing modifications in the tree extraction algorithm 
to get rid of this noise and giving the possibility of representing new features 
(for instance, loops). 

Furthermore, considering the whole scheme of this work (tree representation, 
inference algorithms and error model), the editing distance algorithm gives a 
somewhat inaccurate measure, which causes ambiguity. Another alternative tree 
representation (perhaps qtrees 0), a previous weight learning operation step, or 
the use of probability as an addition to the scheme, might help to improve the 
performance. 
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Abstract. The problem of continuous handwritten text (CHT) recogni- 
tion using standard continuous speech recognition technology is consi- 
dered. Main advantages of this approach are a) system development is 
completely based on well understood training techniques and b) no seg- 
mentation of sentence or line images into characters or words is required, 
neither in the training nor in the recognition phases. Many recent papers 
address this problem in a similar way. Our work aims at contributing to 
this trend in two main aspects: i) We focus on the recognition of indi- 
vidual, isolated characters using the very same technology as for CHT 
recognition in order to tune essential representation parameters. The 
results are themselves interesting since they are comparable with state- 
of-the-art results on the same standard OCR database. And ii) all the 
work (except for the image processing and feature extraction steps) is 
strictly based on a well known and widely available standard toolkit for 
continuous speech recognition. 

Keywords: Off-Line Continuous Handwriting Text Recognition, Fea- 
ture Extraction, Language Modelling, Hidden Markov Models, Bank 
Check Legal Amount Recognition 



1 Introduction 

The recognition of off-line, continuously handwritten text is proving to be a 
quite challenging pattern recognition task. Although text is basically compo- 
sed of characters, most traditional approaches to optical character recognition 
(OCR) generally fail in this task because of the extreme difficulty of segmenting 
continuously written text into characters. In fact, not even the segmentation into 
words can be reliably accomplished using standard techniques in most cases. Ne- 
vertheless, humans do accurately perform both segmentation and recognition in 
a seemingly effortless manner. Accurateness is achieved by “delaying” recogni- 
tion until the highest perception level: only after having understood a written 
(part of a) message are humans capable to “recognize” the constituent words. 
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the corresponding characters and the underlying segmentations. Clearly, this 
streaking human ability comes from a tight cooperation of morphologic, lexical, 
syntactic and semantic-pragmatic knowledge to accomplish the task. 

Just this very same situation appears in the field of continuous speech recogni- 
tion (CSR) p. In this field, successful techniques already exist which are actually 
based on approaching the abovementioned tight cooperation of knowledge sour- 
ces. After many decades of research in this field, commonly accepted adequate 
solutions come from three basic principles: i) adopt simple, homogeneous and 
easily understandable models for all the knowledge sources, ii) formulate the re- 
cognition process as an optimal search through an adequate structure based on 
these models, and iii) use adequate techniques to learn the different models from 
training data of each considered task. All these principles are properly fulfilled 
by the use oi finite- state (FS) modeling techniques such as hidden markov models 
(HMM) and stochastic FS (SFS) grammars or automata |2|. 

In this paper, we address the problem of continuous handwritten text (CHT) 
recognition using standard CSR technology. Many recent papers address this 
problem in a similar way (see, among other mm)- Our work aims at contri- 
buting to this trend in two main aspects: i) As an important part of the work, we 
focus on the recognition of individual, isolated charaeters using the very same FS 
technology as for CHT recognition. This study parallels similar work on phonetic 
decoding that has proved quite helpful to optimize CSR systems. The results are 
themselves interesting since they are comparable with state-of-the-art results on 
the same standard OCR database. And ii) all the work (except for the image 
processing and feature extraction steps) is strictly based on the well known and 
widely available standard HTK toolkit for CSR [Zj. As an application example, 
in this work we focus on the recognition of legal amounts in bank checks. 



2 Feature Extraction 

Following current trends in HMM-based off-line handwriting text recognition [HI , 
the image of a text sentence or line is represented as a sequence of feature vec- 
tors. The height of the image is first normalized to a constant value so as to 
minimize the dependence on writing style size. Then, the image is divided into 
a grid of squared cells whose size is a small fraction of the image height (tested 
values are 1/16, 1/20 and 1/24). Each cell is characterized by the following sim- 
ple and script-independent features: normalized grey level, horizontal grey-level 
derivative and vertieal grey-level derivative. 

To obtain smoothed values of these features, feature extraction is not re- 
stricted to the cell under analysis but extended to a 5 x 5 window centered at 
the current cell. To compute the normalized grey level, the analysis window is 
smoothed by convolution with a 2-d Gaussian filter. On the other hand, the 
horizontal grey-level derivative is computed as the slope of the line which best 
fits the horizontal function of column-averaged grey levels. The fitting criterion 
is the sum of squared errors weighted in accordance with a 1-d Gaussian filter 
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which enhances the role of central pixels in each analysis window. The vertical 
grey-level derivative is computed in a similar way. 

Columns of cells or frames are processed from left to right and a feature 
vector is constructed for each frame by concatenating the features computed in 
its constituent cells (see fig. [Q. 




Fig. 1. Feature extraction for the Spanish Sentence “mir (one thousand). Top: Origi- 
nal image; bottom: two representations into sequences of k-dim vectors; left fc = 16 x 3, 
right k = 20 X 3. Each (column) vector is divided into three blocks (from top to bottom) : 
normalized grey levels, horizontal derivatives and vertical derivatives. 



3 Character, Word, and Sentence Modelling 

Individual characters are modelled by continuous density left-to-right hidden 
Markov models (HMM), similar to those used in CSR Fig. El shows an ex- 
ample of the structure of one of these models. Basically, each character HMM 
is a SFS device that has to model the succession, along the horizontal axis, 
of (vertical) feature vectors which are extracted from instances of this charac- 
ter. It is assumed that each HMM state generates feature vectors following an 
adequate parametric probabilistic law; typically, a mixture of Gaussian densi- 
ties. The required number of densities in the mixture depends, along with many 
other factors, on the “vertical variability” typically associated with each state. 
This number needs to be empirically tuned in each task. On the other hand, 
the number of states that is adequate to model a certain character or character 
set depends on the underlying “horizontal variability”. For instance, to ideally 
model a capital “E” character, only two states might be enough (one to mo- 
del the vertical bar and the other for the three horizontal strokes), while three 
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states may be more adequate to model a capital “H” (one for the left verti- 
cal bar, another for the central horizontal stroke and the last one for the right 
vertical bar). Note that the possible or optional blank space that may appear 
between characters should be also modelled by each character HMM. The most 
appropriate number states for a given task also depends of the amount of trai- 
ning data which is available to train model parameters. So, the exact number 
of states to be adopted needs some empirical tuning in each practical situation. 
Once a HMM “topologi/^ (number of states and structure) has been adopted, the 
model parameters can be easily trained from continuously written text {without 
any kind of segmentation) accompanied by the transcription of this text into the 
corresponding sequence of characters (c.f. Sect. t1.2L This training process 
is carried out using a well known instance of the EM algorithm called haekward- 
forward or Baum-Weleh re-estimation P^. Obviously, the very same technique 
can also be used if isolated versions of the individual characters are available 
(c.f. Sect. 16. Ill , as in standard OCR. 




Fig. 2. Structure of a Character Left-to-Right Hidden Markov Model aimed at mo- 
delling instances of the character “n” . 



Words are obviously formed by concatenation of characters. In our FS mode- 
ling framework, for each word, a SFS automaton is used to represent the possible 
concatenations of individual characters to compose this word. This automaton 
also takes into account optional capitalizations, as well as the blank space usually 
left at the end of each word (as previously discussed, the possible inter-character 
blank space is modeled at the character level HMM). An example of automaton 
for the Spanish word “mfZ” is shown in Fig. 0 

Sentences are formed by the concatenation of words. In contrast with CSR, 
blank space often {but not always) appears between words. As previously discus- 
sed, this optional blank space is modeled at the lexical level. The concatenation 
of words is modeled by a (FS) language model. In our bank check example appli- 
cation, it consist in a FS grammar which recognizes all the text written Spanish 
numbers from 0 to 10^^ — 1. The terminal symbols (or lexicon) of this gram- 
mar are Spanish words used to write numbers, such as “uno”, “dos”, “diez”, 
“sesenta”, “cien”, “mil”, “millon”, etc. (one, two, ten, sixty, hundred, thou- 
sand, million, etc). Moreover, this model is built as a sequential FS transducer 
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@ 



0.3 




Fig. 3. Automaton for the lexicon entry “mi/” . The symbol is for a blank segment. 



which also provides an output for each input sequence of words. The output is 
an arithmetic expression whose value is that of the number given through the 
input text; for example, from the Spanish text ” doscientos sesenta y dos mil 
veinte” (two hundred sixty two thousand and twenty) the obtained output is: 
“+(200 + 60 + 2) * 1000 + 20” . From this expression the target (decimal) number 
(262,020) can be easily obtained. A small fragment of this transducer is shown 
in Fig. E] 

The aim of this setting is similar to that in . However the approach followed 
here is strictly based on FS technology and is therefore much simpler. In fact, in 
our system the required decimal digit string is just obtained by directly piping 
the output of the CHT recognizer to the standard Unix tool “be” . Furthermore, 
(in most languages) this last evaluation step can be avoided all together by 
the use of a slightly more powerful kind of FS devices known as “subsequential” 
transducer |H] . These FS devices, which are automatically learnable from training 
data |H| , allow direct translation of text-represented numbers into decimal form. 
In this way, the use of syntax- directed transducers as in jS| (which are not FS 
and would therefore break our homogeneity assumption) is no longer needed. 

Some features of our numbers transducer are: 51 input words, 187 output 
tokens, 32 states, 660 transitions; {test-set) Perplexity: 6.2. 




mil 



Fig. 4. A piece of the numbers transducer. Solid-line edges correspond to a path that 
accepts “doscientos sesenta y dos mil veinte” (two hundred sixty two thousand and 
twenty), yielding “-h(200-l-60-l-2)*1000-l-20” . 
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4 Knowledge Integration: Recognition as a Best 
Hypothesis Search 

Once all the character, word and language models are available, recognition of 
new test sentences can be performed. Thanks to the homogeneous FS nature of 
all these models, they can be easily integrated into a single global (huge) FS model 
that accepts sequences of raw feature vectors and outputs strings of recognized 
words (and, in our application, also the corresponding arithmetic expressions). 
Fig. El illustrates this integration. 

Given an input sequence of feature vectors, the best output hypothesis is one 
which corresponds to a series of states of the integrated model that, with highest 
probability, produces the input feature-vector sequence. This global search pro- 
cess is very efficiently carried out by the well known (beam-search-accelevaied) 
Viterbi algorithm ■ This technique allows integration to be performed “on the 
fly” during the decoding process. In this way, only the memory strictly required 
for the search is actually allocated. 

5 Experiments 

Experiments have been carried out to tune certain constants at each knowledge 
level, to train HMM parameters and to test the recognition performance of the 
resulting systems. 



5.1 Isolated Character Recognition: Optimizing Feature Extraction 
Parameters 

Although accurate recognition requires knowledge at higher perception levels, 
isolated character recognition serves as a good basis to make adequate decisi- 
ons on feature extraction parameters. To this end, rather simple features were 
empirically compared in order to assess their discriminating power in the clas- 
sification of the 18 lowercase letters appearing in our bank check application 
(“a,c,d,e,h,i,l,m,n,o,q,r,s,t,u,v,y,z”). 



u 




Fig. 5. A small piece of an integrated FS model, using three-state character HMMs. 
The part shown stands for the sentences “mil”, “mil uno” and “mil dos” { 1 , 000 ; 1 , 001 ; 
1 , 002 ). For the sake of clarity, only un-capitalized word models are shown and output 
arithmetic-expression tokens are omitted. 
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Letter samples were extracted from the widely used NIST Special Database 
3 j0|. This database, published on CDROM [101, includes 45,313 binary images 
of (segmented) lowercase letters extracted from forms written by 2,100 writers. 
For our purposes, a moderate-size training set was built out of samples of our 
18 lowers from the first 200 writers (overall 3,034 samples, one sample of each 
lower per writer, when available). Similarly, a test set of 750 samples was extrac- 
ted from 50 independent writers (writers #1,051 through #1,100). Following 
the procedure described in section El six different image representations were 
considered by combining two sets of features (grey-level alone or grey-level plus 
derivatives) and three vertical resolutions (1/16, 1/20 or 1/24). 

Left-to-right four-state continuous-density HMMs were used to model each 
character. Each state had assigned a mixture of k Gaussian densities with dia- 
gonal covariance matrices. Values of k in {1,2,4,8,16,32} were tried in the 
experiments. HMMs were trained and tested using the HTK toolkit |C]. More 
precisely, three cycles of Baum- Welch re-estimation were run to train model pa- 
rameters, while the Viterbi recogniser was used to classify each test sample in 
accordance with the most likely HMM. Results are shown in Table [U 



Table 1. Classification error rate (in %) for isolated character recognition, using 
four-state continuous-density HMMs, different feature sets (with/out derivatives) and 
varying number of Gaussian densities per model-state. 



Set of 


Vertical 


Number of Gaussian densities per state 


features 


resolution 


1 


2 


4 


8 


16 


32 




1/16 


33.8 


28.2 


24.0 


23.4 


21.0 


20.9 


Grey Level 


1/20 


31.3 


26.8 


20.7 


19.0 


17.9 


16.6 




1/24 


31.0 


26.8 


23.1 


19.5 


17.2 


14.1 


Grey Level 


1/16 


21.9 


16.7 


14.6 


12.3 


10.9 


10.5 


& 


1/20 


20.1 


17.9 


13.4 


11.1 


9.5 


11.0 


Derivatives 


1/24 


21.2 


18.4 


14.3 


11.8 


9.2 


10.1 



From these results, it is clear that using grey-level derivatives significantly 
improves recognition accuracy. Vertical resolution, however, does not seem to 
be a key factor: a vertical resolution of 1/20 might be good enough for the 
experiments with continuous text. On the other hand, it is worth noting that 
our best error rates (9.2% and 9.5%) are similar to the best figures reported (at 
zero-rejection rate) after the First Census OCR Conference (11% for the best 
system and 8.6% for two-pass human classification) ^3 P- 26]. Although we do 
not face classification of the whole set of lowers, these results encourage us to 
continue exploring the application of FS technology to OCR. 

5.2 Recognition of Sentences Composed of Artificially Concatenated 
Characters: Assessing the Power of Model Integration 

The data for the second series of experiments consisted of 500 images of random 
sentences composed by the concatenation of the appropriate randomly selected 
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j oje^¥fo erJk)t\e.i 

i-rc/th-i-a. y se-K mJ (Xj^rifA. 

m;L setee,/e?ute>s M/UtiKes ve.,~>J:!d6s m;L vcrhtisiete 

Fig. 6. Examples of sentences produced by concatenating randomly selected hand- 
written characters from the NIST database: 74,000,000; 36,080; 1,700,022,027. 



handwritten isolated characters. Overall, these images contained 9,504 charac- 
ters, which were drawn from the same corpus as that of the previous experiment. 
The sentences corresponded to simulated legal amount numbers in bank checks 
(see examples in Fig.Ej). From this data, 313 sentences (609 words, 5,900 charac- 
ters) were devoted to train the character models and 187 sentences (975 words, 
3,604 characters) to test the performance of the recognition system. No data 
from “training writers” were used in the composition of test sentences. 

The main difference between this setting and that of the previous experiment 
is that now training (and testing) is carried out using long images of continuous 
text, without any kind of segmentation or information about the actual position 
of the characters in each sentence. The aim of this experiment was to asses 
the power of integrating morphologic, lexical and language models to improve 
recognition performance. 

Automatically determining a different topology and/or number of states 
which is best suited to model each particular character proves to be a non- 
trivial problem. Therefore, in this work, identical topology was adopted for all 
the characters. Left-to-right continuous-density HMMs of N states and k Gaus- 
sian densities per state {N G {4,6,8},/c G {1,2,4,8,16,32,64}) were used for 
character modeling plus a special model for the blank character (“@” in Fig. El 
and 0 ). In this case, the training procedure was the usual one for acoustic mo- 
deling of phone units in continuous speech: character-level HMMs were trained 
through four iterations of the Baum- Welch algorithm. This process was initiali- 
zed by a linear segmentation of each training image into a number of equal-length 
segments according to the number of characters in the ortographic transcription 
of the sentence. As in the previous experiments, these models were trained using 
the HTK toolkit. For each test input sentence, the Viterbi decoding algorithm 
was performed on the FS network which integrates character, lexicon and lan- 
guage models. 

Test-set recognition word error rates (WER) are presented in Table |3 Best 
results were achieved using 4-state character HMMs with 8 and 16 Gaussian 
densities per state. In this case, a WER of 3.0% was obtained. If compared with 
the isolated character error rate (tb. Ill , these results clearly show the power of 
model integration. The corresponding digit error rate, obtained by evaluating the 
arithmetic expression obtained as the translation of each recognized sentence, 
was 2.8%, with 1.8% substitution errors, 0.9% digit deletions and 0.1% insertions. 
From the 187 test sentences, 14 (7.5%) yielded decimal numbers with one (or 
more) digit(s) in error. Digit error rates are the relevant results if the output 
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of legal amount recognition is to be validated with the help of the OCR results 
obtained from the corresponding courtesy amount in the same bank check. 

Table 2. Test set recognition word error rates (in %) for artificially concatenated 
characters. Results for different numbers of states per model and Gaussian densities 
per state are reported. 



States 


1 


Gaussian densities 
2 4 8 


per 

16 


State 

32 


64 


4 


8.5 


5.3 


3.3 


3.0 


3.0 


3.3 


7.6 


6 


6.1 


3.8 


3.1 


3.5 


3.3 


3.9 


10.2 


8 


14.4 


10.6 


9.1 


7.8 


7.0 


10.1 


17.2 



5.3 Recognition of Real Continuous Text Sentences 

The corpus for this experiment was composed by 485 real images of handwrit- 
ten Spanish legal amounts (2,127 words, 16,039 characters), handwritten by 29 
different writers (see Fig. 0 ). 298 randomly selected sentences from 18 writers 
were used for training and 187 from the rest of the writers were left for testing. 



t A 

Fig. 7. Examples of real continuous text sentences: 1 , 102 ', 38 , 000 , 024 ', 16 , 400 , 026 . 



The training and testing procedures were the same as those described in 
section Test-set recognition Word Error Rates (WER) are presented in Ta- 
ble 01 Graphic results for the best number of states (5) and the best number of 
densities per state (16) are also shown in Fig. 01 For the best setting, a WER 
of 18.0% and a DER of 13.2% were obtained. These good results clearly assess 
the adequateness of the proposed technology for continuous handwritten text 
recognition. 

6 Conclusion 

The problem of CHT recognition can be adequately addressed using standard 
CSR technology on images that are represented by sequences of fairly simple 
feature vectors. 

An image is divided into a sequence of vertical sets of cells. Each cell is 
represented by the normalized grey level in its vicinity and the corresponding 
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Table 3. Test-set recognition word error rates (in %) for continuous haudwritteu 
senteuces of legal amount numbers in bank checks. Results for different number of 
states per model and Gaussian densities per state are reported. 



Number of gaussians 


Number of states 


per model I 


per state 


3 


4 


5 


6 


7 


1 


77.4 


61.1 


55.5 


52.6 


52.1 


2 


69.2 


48.6 


39.8 


35.7 


36.6 


4 


55.3 


37.2 


27.0 


25.6 


28.5 


8 


44.3 


30.5 


21.3 


19.3 


25.9 


16 


38.0 


26.0 


18.0 


18.6 


24.5 


32 


33.9 


24.2 


18.4 


20.1 


25.3 


64 


36.3 


26.5 


23.1 


25.5 


33.9 




Fig. 8. Test-set recognition word (and digit) error rates (in %) for the continuous 
handwritten sentences of legal amount numbers in bank checks. Results for different 
number of states per model and 16 Gaussian densities per state (left) and results for 
different number of Gaussian densities per state and 5 states per model (right) are 
shown. 



horizontal and vertical grey-level derivatives. Using these vector sequences, cha- 
racter hidden markov models can be trained by the well known Baum-Welch 
re-estimation algorithm. New handwritten text can then be recognized through 
the standard Viterbi decoding algorithm on a (virtually) integrated FS network 
composed by character, lexicon and syntactic FS models. 

This methodology has been tested on the recognition of isolated characters 
(OCR) with quite competitive results. Also, experiments on the recognition of 
legal amounts in bank checks have been performed, with very promising results. 
These results, however, should only be considered preliminary, since they have 
been obtained without the help of simple normalisation preprocessing procedures 
that are quite standard in this task. Work is in currently under way to include two 
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of these procedures, namely, slant normalisation and dynamic baseline detection, 
with well known potential for significant performance improvements. 
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Abstract. This paper describes the structural classification method 
used in a strategy for retrospective conversion of docnments. This stra- 
tegy consists in an cycle in which document analysis and document un- 
derstanding interact. This cycle is initialized by the extraction of the 
outline of the layout and logical structures of the document. Then, each 
iteration of the cycle consists in the detection and the processing of in- 
consistencies in the document modeling. The cycle ends when no more 
inconsistency occurs. 

A structural representation is used to describe documents. This repre- 
sentation is detailed. 

Retrospective conversion consists in identifying each entity of the do- 
cument and its structures as well. The structural classification method 
based on graph comparison is used at several levels of this process. Graph 
comparison is also used in the learning of generic entities. 

Keywords: retrospective conversion, document structure. 



1 Introduction 

This paper describes a strategy used for retrospective conversion of document. 
Retrospective conversion of documents consists in constructing a document re- 
presentation from the document image. The obtained representation can easily 
be modified to an electronic format. Retrospective conversion is useful because it 
allows paper documents to benefit of advantages of electronic documents which 
can be edited, diffused, indexed and archived. 

Retrospective conversion of document is often constituted of two major steps 
(cf. FiglQ): document analysis and document understanding. Document analysis 
consists in extracting the layout structure of a document from its image. Docu- 
ment understanding aims at building the logical structure of the document. 

In this paper we propose a strategy for retrospective conversion of documents 
based on structural classification. S ect ion | 2 | details the document representation. 
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Fig. 1. Retrospective Conversion of a Document 



The algorithm used for structural classification is presented in section 01 Section 0] 
details the different steps of document understanding 



2 Document Representation 

A document can be described by two structures: the layout structure and the 
logical structure Pj . The layout structure hierarchically models the visual aspect 
of documents. It is obtained by extracting and classifying graphical elements 
of the document image. These graphical elements are represented by so called 
layout objects. The logical structure represents the document organization on 
the basis of the meaning of the content. The logical structure describes the way 
a document can be parted into title, sections, subsections, paragraphs... Each 
logical element is described by a logical object. 

Documents can grouped into classes. A document class is a set of documents 
which share a part of their layout structure and logical structure. The part of 
the structure which is shared by all the documents from a class is the generic 
structure. It defines a structure class. Then each document class is represented 
by a generic layout structure and a generic logical structure. 

Objects can also be grouped into classes. A generic object (generic layout 
object or generic logical object) describes an object class and is constituted of 
the features common to each object of the class. 

In our document representation, layout objects represent graphical elements 
of the document image (a text line, a text block, a text column, an image...) 
and logical objects represent meaningful entities (title, section, subsection...). 
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An object can be a basic object or a compound object. Each object has four 
attributes (see Fig. 0 ): 

— its label; 

— a numerical feature vector; 

— the label of its parent object; 

— its structure; 

The label of the object is the name of the class it belongs to. The object 
classification process consists in determining this attributes. 

The numerical feature vector contains intrinsic informations. The feature 
vector of a layout object contains visual indices concerning the graphical entity 
represented (location, dimension, black pixel density). The feature vector of a 
logical object contains formating information (alignment, style, size). 

A graph GobjiVobj, Eobj,0(obj, f^obj) represents the structure of each object. If 
the object is a basic object, the graph is empty, but if it is a compound object, 
each node of the graph is labeled by the class of its components. An edge is 
established between two nodes if the components present a neighboring relation. 

The feature vector, the label of the parent object and the structure of the 
object are used in the object classification. This process is detailed in section 




Fig. 2. Structure of an object 



The structures describe the way the objects are organized in the document. 
Two graphs G/ay(hzay: /?/ay) and GiogiViog ■, Eiqq ^ cxiog ^ Piqq^ represent 

the layout (Fig. 0 and logical (Fig. 0 structures of the document. The nodes 
of these graphs are labeled by the label of the objects and edges describe hier- 
archical and neighboring relations. 
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Document 

Title 

Principal Title 
Subtitle 
First Part 
Title 
Paragraph 
Second Part 
Title 
Paragraph 
Third Part 
Title 

Paragraph 1 
Paragraph 2 
Notes 
note 1 
note 2 




Fig. 4. Logical structure of a document 
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3 Structural Classification 

Section ^describes the different elements used for document representation. This 
structural representation uses graphs to represent document layout and logical 
structures and object structure. 

The retrospective conversion consists in extracting the different elements of 
the document representation from its image and in determining the class of each 
object and finally the class of the document. Structural classification helps in 
that task. This section details the method used for structural classification. 

Our structural classification is based on the search of a subgraph isomor- 
phism ^ between the graph to be identified and graphs representing generic 
entities. For example, if the graph to be identified represents the structure of 
layout object, its is compared with all graphs representing the structure of gene- 
ric layout objects. If the graph represents the logical structure of the document, 
it is compared with the graphs representing the generic logical structures. 

Each comparison of two graphs Gi(Ei, Ei, oi, / 3 i) and 02(^2, V2, 02, P2) pro- 
duces a graph (i^) which is constructed as follow. First, the grea- 

test matching between equivalent edges from V\ and V2 is searched. Two edges 
are consisdered equivalent if their label are equals and if the label associated to 
their extremities are equals. This produces an initial version of G3. is com- 
pleted by finding the greatest matching between nodes from E\ and E2 which 
have not been associated during the first step. 

We define a similarity measurement 0 5 {G\,G 2 ) between Gi and G2. Two 
overlapping rates t\ and ^2 are determined. t\ is defined by the number of nodes 
of G3 divided by the number of nodes of Gi and t 2 is equal to the number of 
nodes of G3 divided by G2. If one of these rates equals 1 , this means that one 
of the graph is included in the other one. In this case, if the other rate is very 
small, then the included graph is very small in regard to the other one. If the 
compared graphs are equal, t\ and ^2 are equal to 1 . A similarity measurement 
can be established as 



<5(Gi,G2) = ^-1. 



4 Retrospective Conversion 

4.1 Interpretation Cycle 

A complete retrospective conversion of documents has to construct a document 
modeling which represents, at least, the layout structure and the logical structure 
of the document. Our strategy is based on a cycle inspired by Ogier in This 
cycle makes document analysis and document understanding interact. The cycle 
(see Fig. ED is initialised by a phase which provides primitive versions of layout 
and logical structures. 
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Fig. 5. Interpretation cycle 



The outline of the layout structure is obtained by extracting graphical ob- 
jects from the document image [Q. This is performed by a segmentation algo- 
rithm applied on the document image after low level processing (deskewing and 
binarisation) . Extracted objects are then associated in composite layout objects 
according to size and proximity criteria. Then, they are labelled (text, graphic, 
image...) according to graphic criteria (size, black pixel density...). New com- 
posite objects are then constructed with adjacent objects which are identically 
labelled. Finally, a first version of the layout structure is obtained. 

The structural classification method which compares a specific structure to be 
identified with structures representing document classes gives a first hypothesis 
concerning the document class. Assuming that a document class contains not 
only a generic layout structure but also a generic logical structure, the outline 
of the logical structure is built by instanciating the generic logical structure 
corresponding to this hypothesis. This instanciation is performed by associating 
a logical equivalent to basic layout objects. 

This initialises the interpretation cycle. Each iteration of the cycle consists in 
the locating and the processing of inconsistencies in the document representation. 
Each time the class attributed to an object and the structure is called into 
question. So objects and structures should be classified every time. 

Different level of consistency are examined. First, we define what we call 
intrinsic inconsistency. It refers to the fact that no generic object contains the 
features observed for the specific object. The object can not be associated to 
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any of the known object classes. On the contrary, an object is said intrinsically 
consistent if its features are able to occur in regard to the known object classes. 

The next consistency level is called contextual neighboring consistency. An 
object is said to be consistent at the neighboring contextual level if there is 
at least one generic object which includes this object and its neighbors as its 
constituents in the observed configuration. 

The hierarchical consistency deals with the fact that an object associated to 
a specific class can, or not, be a constituent of an object of an other class. An 
object is said to be hierarchically consistent if its class is compatible with the 
class of the hierarchically superior object. 

Finally, we define the abstraction level consistency. It deals with the compa- 
tibilty between the class of a logical object and the class of the corresponding 
layout object. This mapping between layout and logical object is not always 
possible. A logical object not always correspond to a single layout object. For 
instance, a paragraph can be split into two text blocks on two columns. However 
the abstraction level consistency can always be evaluated for structures. The 
results of layout structure and logical structure classification must correspond 
to the same document class. 

4.2 Object Classification 

The object classification aims at attributing each object to a known class which 
represented by a generic object. The object to be identified is compared to all 
generic objects. It is performed by making three classifiers cooperate. Each of 
these classifiers gives a list of hypothesis weighted by a similarity measurement. 

A statistical classifier (Nearest Neighbor) uses the distance between the fea- 
ture vector of the object to be identified and the feature vector of the generic 
object it is compared to. The list of hypothesis is weighted by the inverse of the 
distance. 

The structural classifier presented in section 01 is used. It compares the graph 
G representing the structure of the object to the graphs Ggen representing the 
structure of generic objects. This classifier provides a list of hypothesis weighted 
by the similarity measuerment S{G,Ggen)- 

The third classifier uses as information the label of the parent object. 

The results of the three classifiers are exploited by computing an weighted 
sum of the weight associated to each hypothesis. 

4.3 Structure Classification 

After that each object has been classified, the layout and logical structures are 
updated by labeling the nodes of the graphs by the label of the object. Then, the 
layout and logical structures are independently classified. The graph representing 
the structure is compared to the graphs representing generic structures. The label 
attributed to the structure to be identified is the class described by the generic 
structure whose similarity measurement is the greatest, but a weighted list of 
hypothesis is established. 
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4.4 Document Classification 

Finally, once the structures have been classified, an hypothesis concerning the 
class of the document is given. This hypothesis depends on the structure classi- 
fication. Both layout structure classification and logical structure classification 
have given a weighted list of hypothesis concerning the class of the document. 
The choosed hypothesis corresponds to the unweighted sum of the list given by 
the structure classification. 

5 Structural Training 

The graph comparison presented in section0is used in structural training. Struc- 
tural training aims at building generic objects or generic structures. A training 
database is constituted from specific graphs from the same class. The graph re- 
presenting the generic object or the generic structure is built by searching the 
greatest subgraph. 

6 Conclusion 

This paper proposes a strategy for retrospective conversion of documents. This 
is based on the interpretation cycle which consists in classifying each object 
analysing the consistency of the description and solve the inconsistencies. This 
cycle makes document analysis and document understanding dynamically in- 
teract. On one hand, the logical structure is initialised from the knowledge of 
the layout structure. On the other hand, the layout structure is not fixed and 
inconsistencies in the logical structure can lead to call into question the layout 
structure. 

The document representation describes three different contextual relations 
between objects (neighboring relations, hierarchical relations, layout-logical re- 
lations). These differents levels of relation are exploited by the classification 
methods. 

This strategy is being implanted in a document processing system which 
should be able to process a wide range of documents and provide a convenient 
representation. Fig. Elrepresents the graphical user interface of our system which 
allows a user to verify, edit and correct the representation of a document. It is 
also used to build a database of synthetic documents. Even if the first results 
are not significant, they are encouraging. 
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Abstract. This article describes a system being developed to segment 
handwritten date fields on bank cheque images. The handwritten infor- 
mation extracted from the date zone is segmented into day, month and 
year, and a hypothesis is also made on the writing style of the month 
(in word or digits). The system has been implemented and tested on 
cheque images. Subsequent modifications have also been designed and 
implemented to include contextual information in the determination of 
segmentation points. Results have shown that the system is effective; 
with continuing improvements, the system is expected to be a useful 
component for processing the date written on cheques. 



1 Introduction 

This paper describes a system being developed to segment the date information 
handwritten on bank cheques. The ability to automatically process handwritten 
dates is important in application environments where cheques cannot be cashed 
prior to the dates shown, since any delay would entail significant financial costs 
when large numbers of cheques are involved. On the other hand, developing an 
effective date processing system is very challenging due to the high degree of 
variability and uncertainty present in the dates handwritten on standard bank 
cheques, and the different recognizers required for processing the information. 
Perhaps for this reason, there has been no published work on this topic until re- 
cently, when work on the date fields of machine-printed cheques was reported ^ , 
and this reference also considers date processing to be the most difficult target 
in cheque processing, given that it has the worst segmentation and recognition 
performance. 

When a priori knowledge is available about the format or style used in re- 
presenting the date, it is very difficult and computationally inefficient to develop 
a system to process the entire date-Zone at the same time. Therefore, segmen- 
tation is introduced so that the problem can be reduced to the processing of 
separate components. The handwritten date_zone image is segmented into three 
subimages, and each subimage is assigned to the Day, Month or Year field. A 
decision is also made on the writing style of the Month. 
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This approach is different from that of 0, where the algorithm first identifies 
the bigrams ‘95’, ‘96’, ‘97’, which are common to all date patterns (e.g. Jan 25, 
1996, JANUARY 25 1996, 01/25/96. 01-25-96, etc.). It next examines the left 
neighbouring characters to determine whether they are the completion of year 
(i.e. 1996), or a delimiter for the day or month. A tool shell has been designed 
which supports both word spotting through inexact matching, and the use of wild 
card characters to search for formatted patterns such as **/**/**^ **_**_>k*^ 

A sample of 500 valid date fields were segmented by a human operator and tested. 
The complete recognition performance with the automated field segmentation is 
estimated at 44% to 49% by using four commercial recognition devices. 

2 Cheque Databases 

During the design and development phase of an automatic cheque processing 
system, it was necessary to have access to a large quantity of bank cheques 
in order to gain insight into the various ways in which cheques can be written. 
However, such data did not exist at that time. Due to security and confidentiality 
considerations, it was also very difficult to have access to real cheques from banks 
or utility companies even for research purposes. For this reason, this research 
centre decided to create its own databases. 

For the first database (Database 1), a blank cheque was carefully designed to 
have a size and layout similar to that of regular bank cheques. Its background 
is white, and all the lines are printed in special drop-out ink which is invisible 
to the scanner. This facilitates the extraction of written information, and also 
produces extracted images of better quality for the initial development of item 
extraction and recognition processes. The cheques were filled in by university 
students, after which they were scanned and stored as binary images, ready for 
the extraction and further processing of each item of written information |3|. 
Altogether 4564 cheque images were obtained this way. 

For testing and further refinement of the cheque processing system. Database 

2 was created. This consists of over 12,000 images from real-life standard cheques, 
on which the handwritten information had been completed by university students 
and staff, as well as employees of a major utility company. Results on both 
databases are contained in this article. 

3 Writing Style Analyses 

In North America, all bank cheques have a similar layout. The date-Zone is always 
positioned at the upper right corner of each cheque. Machine-printed digits ‘1’ 
and ‘9’ appear near the right end of the date-Zone, thus separating the date-Zone 
into two parts. The part to the right of the printed “19” is intended for entering 
the last two digits of Year. The part on the left is intended for Day and Month, 
and it is completely blank, which implies there is no pre-defined position for Day 
or Month. 
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In addition, there is no restriction on the writing style. Month can be written 
in either digit or word form, while punctuations (period comma (‘,’)i slash 
(7’) and hyphen (‘-’)) or a space can be used to identify the end of a field. Hence, 
great flexibility exists for a writer when entering the Day and Month fields, and 
dates can be represented by a large variety of writing styles, some of which are 
shown in Fig. ^ 



lSUt.1 ,,t?) 






?•/ ^ /*^1 



Fig. 1. Examples of datC-Zone images 



As shown in Fig. ^ the contents of date^zone can be either pure digits when 
Month is written in digits, or a combination of digits and cursive scripts when 
Month is written as a word. Month can be placed either before or after Day. 
In general, the writing styles of date^zones extracted from bank cheques can be 
expressed in the following 4 patterns, with possible additions of punctuations or 
suffixes (such as ‘st’, ‘nd’, etc.): 



dd mm yy 
mm dd yy 
dd MM yy 
MM dd yy 

In the above, dd designates Day written in digits, mm designates Month written 
in digits, MM designates Month written in word form, and yy designates Year 
written in digits. 

To represent a certain date, for example “February 26, 1997”, each of the 
following 8 variations can be used: 



February(,) 26*-*^l(i) 1997 
Feb(.) 26l‘^)(,) 1997 
20(*'‘) February 1997 
26(*'‘) Feb(.) 1997 



(0)2/26(7) 1997 
(0)2-26(7 1997 
26/(0)2(/) 1997 
26-(0)2(-) 1997 



In the above list, items shown within parentheses denote entities which may 
or may not be present. “19” in bold are the printed digits which appear on all 
bank cheques representing the century. Suffixes such as ‘st’, ‘nd’, ‘rd’, or ‘th’ can 
be written either as superscripts or at the same horizontal position as the rest 
of the date-Zone information. 
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3.1 Statistics on Punctuations 

On handwritten date images, no uniform spacing or separator is present bet- 
ween Day and Month. To separate these two fields, punctuations are often used 
by cheque writers, and different punctuations are used. Studies on our cheque 
databases show that period and comma are used mostly when Month is repre- 
sented in word form, while slash and hyphen are mostly written when Month is 
represented by digits. 

Among 4564 such cheques of Database 1, 84.09% have Month written in 
word form, and 15.91% have Month written in numerals. Among the 3837 ima- 
ges where Month is in word form, punctuations appear on 765 images (about 
19.94%). Detailed statistics about punctuation usage in this case are shown in 
Tabled Among the 726 datc-Zone images where Month appears in digits, only 
17 of them do not contain any punctuations. Table 0 illustrates the punctuation 
usage when all three fields of a date -zone are written in numerals. 

Table 1. Use of punctuations when Month is in word 





No. of images 


Percentage (%) 


Total 


765 


100.00 


Period only 


537 


70.20 


comma only 


154 


20.13 


Both period & comma 


38 


4.97 


Others 


36 


4.70 



Table 2. Use of punctuations when Month is in digits 





No. of images 


Percentage (%) 


Total 


709 


100.00 


Slash ‘/’ only 


423 


59.67 


Hyphen only 


234 


33.00 


Both slash & hyphen 


2 


0.28 


Others 


50 


7.05 



“Others” in Table prefer to those date-Zone images where Month is written 
in word, but punctuations used are slash or hyphen. “Others” in Table 0 refers 
to cases where Month is written in digits, but punctuations used are period or 
comma. 

In this database, 3089 images contain no punctuation at all, when Month is 
almost always written in word (only 0.55% of such images have Month written 
in numerals). 

As discussed previously, if punctuations and suffixes are not considered, 
date-Zone writing styles can be classified into 4 categories. Pertinent statistics 
have also been gathered on these aspects. As shown in Table 0 when Month 
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is written in word, the Month word appears before Day in about half of the 
samples, and the reverse is true for the rest. However, when Month is written in 
numerals, more people tend to write it after Day. In addition, on 237 date^zone 
images, both Month and Day are written in numerals, but the sequence cannot 
be determined since both fields have values not exceeding 12. 

Table 3. Statistics about datc-Zone writing patterns 



Writing style pattern 


No. 


of images 


MM dd yy 




1936 


dd MM yy 




1901 


dd mm yy 




442 


mm dd yy 




46 



4 Feature Description 

The input of the automatic date processing system consists of binary datejzone 
images, each of which is decomposed into a set of connected components, which 
are analyzed according to a set of features designed to detect the four types of 
punctuations as well as printed digits T’ and ‘9’. 

In general, two categories of features, shape features and spatial features P, 
can be considered. Shape features deal with the geometric aspects of each connec- 
ted component, particularly its appearance and measurements. Spatial features 
deal with the contextual aspects of each connected component, which provide 
important information especially when the objective is to process a text line. 
They are used to describe the location of each connected component with res- 
pect to the entire datc-Zone image as well as its neighbouring components. 

The shape features used to describe punctuations are high-density, narrow, 
fiat, slope, small, simple-curve and no_innerloop', while the spatial features used 
consist of exceedjneighhour, at-middlezone, mid_to -neighbour, below-lowerhalf and 
low-to-left 0 . The selection of features for detecting each punctuation was ba- 
sed on experimentation with all the developed shape and spatial features. The 
following four sets of features were chosen to characterize the punctuations: 

— Slash: narrow and exceed-neighbour; 

— Hyphen: high-density, flat, at-middlezone and mid-to -neighbour; 

— Period: high-density, small and below Jowerhalf; 

— Comma: narrow, small and below-lowerhalf. 



5 Segmentation Strategies 

The purpose of date-Zone image segmentation is to divide the entire image into 
three subimages representing Day, Month and Year respectively, and also to 
generate a hypothesis on how Month is written so that it can be processed using 
the appropriate recognizer. 
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Since machine-printed digits ‘1’ and ‘9’ are present on all standard bank 
cheques, it would be more reliable to start segmentation by searching for the 
printed “19”, because this is a more stable part of the image than the rest of the 
date-Zone which contains completely unconstrained handwritten information. 
To detect these printed characters, the connected components of the date-Zone 
image are examined in sequence from the right until a printed ‘1’ is found or 
the left end of the date image is reached. The ‘1’ is identified by the slope, 
narrow and high-density features since the stroke is mostly vertical, relatively 
narrow and occupies most of its bounding box. Once this is located, its right 
neighbour is examined to see if it is the printed digit ‘9’. If this component 
contains exactly one inner loop, and it has approximately the same height as its 
left neighbour, these two adjacent connected components are considered to be 
the printed digits “19” . These components are then removed, resulting in two 
separate subimages, Day&Month and Year. The subimage on the left is assumed 
to be the Day&Month subimage, and it will be further segmented into Day and 
Month subimages. 



5.1 Observations on Day and Month Segmentation 

As observed previously, four types of punctuations are frequently seen in date-Zone 
images. The position of a punctuation can mark the end of Day or Month field, or 
a division between Day and Month. Based on the statistics presented in Tables ^ 
and 13 the type of punctuation written after Month can suggest how this field is 
written. The presence of slash or hyphen strongly implies that Month is written 
in digits, while the presence of period or comma implies a word. Therefore, once 
the Day&Month subimage is scanned for these four types of punctuations, a 
segmentation between Day and Month is suggested. 

It has been observed that a higher proportion of date-Zone images do not 
contain punctuation(s); in these cases. Month is almost always written as a word, 
and most people tend to leave a gap between Day and Month. The locating of 
an interword gap can then provide a suitable segmentation point between Day 
and Month. 

For the detection of an interword gap, many algorithms can be used to com- 
pute the distances between pairs of connected components |5I6| . In this work, we 
consider the maximum gap to occur where the maximum distance between neig- 
hbouring components occurs on the largest number of scan lines. This method is 
completely independent of threshold values, and it is effective and computatio- 
nally efficient. Statistics show that among those cheques where no punctuation 
is written between Day and Month, or only one punctuation is written at the end 
of Day&Month subimage, 80.11% of the maximum gaps are correctly detected 
by this method. 

After punctuations have been detected on the Day&Month subimage, a set of 
empirical rules are applied to obtain the result of segmentation and hypotheses. 
The segmentation process separates the Day&Month subimage into two parts, 
while hypotheses are made on which part represents Day and which represents 
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Month, and whether the Month field is written in word or digits. The rules used 
at this step are based on: 

— the number of punctuations detected; 

— locations of the punctuations, and; 

— types of punctuations detected. 

6 Results of Segmentation 

Table 0 shows the performance of date^zone image segmentation on the images 
of Database 1, where “correct” means that the cutting positions are properly 
located and hypotheses are correctly generated, so that each field can be sent to 
the appropriate recognizer (digit or cursive word). 

Table 4. Performance of datC-Zone segmentation on Database 1 



Total no. 


Correct (%) 


Reject (%) 


Error (%) 


4564 


74.96 


7.82 


17.22 



In Table 0 th© rejection of a date^zone image can be due to one of the 
following reasons: 

(a) Improper binarization causes too many (broken) components to be created 
in the image. Such images are simply rejected and excluded from further 
processing. 

(b) Some date^zone images contain only one connected component representing 
the entire datc-Zone. Since this provides no clues to segmentation, these 
images are rejected. 

(c) Printed digits “19” cannot always be detected because they are almost com- 
pletely overwritten by other fields, or touching the rest of the image. So- 
metimes ‘1’ and ‘9’ are distorted so that their features can not be detected 
properly. As locating “19” is the first step in date^zone segmentation, these 
images are rejected. 

(d) The Day&Month subimage contains only one connected component, which 
provides no clues for segmentation. 

Punctuation detection is also a critical step in Day&Month subimage seg- 
mentation as described above, and therefore their correct detection is significant 
for the performance of this process. Table Elgives the performance of each punc- 
tuation detection algorithm on Database 1. It shows that the methods proposed 
by this research work are effective for this purpose. 

However, errors do occur during this step, in that components of other fields 
are misinterpreted as punctuations. For example, broken strokes (most probably 
due to binarization) may be mis-interpreted as punctuations, and the digit ‘1’ 
or letter ‘F can be identified as slash ‘/’. For a more accurate determination of 
punctuations and segmentation points, a further strategy has been developed 
and described below. 
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Table 5. Performance of punctuation detection on Database 1 





Slash (%) 


Hyphen (%) 


Period (%) 


Comma (%) 


Correct 


94.72 


94.02 


92.80 


93.63 



7 Segmentation with Confirmation 

Given that date-Zone image segmentation depends heavily on the accurate de- 
termination of a separator between the Day and Month fields, it is important 
to improve the accuracy in the detection of this separator, which may be a 
punctuation or an interword gap. This separator can be determined from: 

(a) Presence of a significant gap or punctuation within the Day&Month subi- 
mage, or 

(b) A transition between digits and letters. 

Consequently, we develop and add a confirmation procedure to our segmen- 
tation strategy, so that a two-level strategy is implemented. Using this strategy, 
more candidates for punctuation and gap are determined than before. Howe- 
ver, these candidates are considered to be separators only if they satisfy more 
stringent conditions than used previously. Otherwise the confirmation procedure 
is applied at the second level by considering the contextual information or the 
nature of the subimage on either side of the candidate. 

For example, locating the interword gap in the Day&Month subimage can 
be a difficult task because this gap may not always appear as the widest gap 
observed in the Day&Month subimage when users write the date freely. However, 
if it can be determined that a gap occurs at the transition between numeric and 
alphabetic fields, then this gap can be considered to be gapjjM, the gap between 
the Day and Month fields. Similarly, when a candidate for slash is considered, 
this candidate can be confirmed as slash when subimages on both sides of the 
candidate show high likelihoods of being numeric, and not confirmed as slash 
when both subimages are highly unlikely to be numeric. (In the former case, we 
apply our experimental knowledge that slashes are often used when both Day 
and Month are represented numerically, and also the a priori condition that each 
such field should contain at most two digits, so that a symbol appearing between 
two numeric images should be considered a separator rather than digit ‘1’). 

For our purpose, the likelihood of a subimage being numeric is determined 
by a combination of the following information: 

(a) the confidence value and the number of digits in the subimage returned by 
a connected digit recognizer jZ], and 

(b) structural features of the subimage. These consist of the maximum number 
of horizontal runs, the sum of the numbers of peaks and valleys, and the 
number of inner loops in the subimage. These numbers should be below 
certain thresholds for numeric subimages, which are usually simpler than 
alphabetic ones. The thresholds are determined through experimentation 
with the training set. 
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Based on the above information, a measure can be obtained that represents 
the likelihood of a subimage being numeric, and this measure is used in the confir- 
mation process described above. The effectiveness of this measure Confidnumeric 
in differentiating between alphabetic and numeric images can be seen from re- 
sults obtained on 4205 samples of month words and 4000 numeric samples from 
Database 2. These are shown in Fig. |3 and it can be seen that there is a strong 
correlation between very high (low) values of Confidnumeric and numeric (alpha- 
betic or word) samples. Of course, the accurate determination of Confidnumeric 
is also important for the detection of the style of Month, and an appropriate 
recognizer can be selected for subsequent processing. 




Fig. 2. Relationship of Confidnumeric to numeric and word images 



Using this measure, the two-level strategy is implemented and tested on 809 
cheques of Database 2, and the results are given in TableEl for both the previous 
and two-level methods. 



Table 6. Performance of two methods for date-Zone segmentation (809 images) 





Correct (%) 


Reject (%) 


Error (%) 


Two-level strategy 


83.19 


4.82 


11.99 


Previous Method 0 


70.83 


5.07 


24.10 



From Table 0 it is observed that the new approach has achieved a higher 
correct segmentation rate and lower error rate than the previous method under 
the same testing conditions. 
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8 Concluding Remarks 

This paper proposes a method of automatically segmenting the date information 
handwritten on bank cheques, together with an improvement of this method. The 
improvement depends on using contextual information provided by a connected 
digit recognizer, and it has been found to be effective. We will continue to further 
improve our work in this area through the incorporation of contextual informa- 
tion and results from the recognizer(s). 
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Abstract. In recent years numerous approaches to automatic reaso- 
ning about mechanical assemblies have been published. GAD techniques, 
graph-based approaches and semantic methods to model assembled ob- 
jects were proposed. However, all these methods are difficult to employ 
for visually supervising assembly processes since they lack the flexibility, 
generality or ease required in practice. This paper presents the idea of 
using syntactic methods known from discourse theory in order to model 
classes of mechanical assemblies. Moreover, a criterion to derive recursive 
grammatical production rules is introduced so that the representation of 
a class of assemblies becomes especially compact. This general repre- 
sentation scheme allows to automatically derive hierarchical structural 
descriptions of individual assemblies by means of computer vision. 



1 Introduction 

The work presented in this paper is embedded in a research project studying ad- 
vanced human-machine communication |||. The project’s purpose is to develop 
a robot which is able to process visual and acoustic data in order to recognize 
and manipulate arbitrarily positioned objects in its environment according to 
the instructions of a human. The domain was chosen to be the cooperative con- 
struction of a toy-airplane using parts of a wooden construction-kit for children 
(see Fig. GJ. As it is customary for every mechanical construction process this 
scenario implies that objects from a set of separate parts are assembled to form 
more complex units. These units are called mechanical assemblies and are defi- 
ned to be sets of solid parts with a distinct geometric relation to one another 
0. Subassemblies are subsets of parts consisting of one or more elements in 
which all parts are connected. Thus, in a construction process subassemblies are 
sequentially assembled and the resulting products typically show a hierarchical 
structure. 

Supervising robotic assembly by means of a computer vision system requires 
a representation of knowledge about feasible assemblies in order to recognize 
complex objects in image data. Concerning flexibility, our system must not be 
specialized on the construction of airplanes. Thus, a compact representation is 
required which enables to describe any of the numerous feasible toy-assemblies. 
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Throughout the last decade there was extensive research concerning flexible 
assembly. In geometry based modeling CAD techniques are used to describe 
elementary objects and assemblies m Graph-based methods model certain re- 
lations between elementary parts |5], whereas semantic approaches make use of 
manually created hierarchical descriptions There also are sensor based appro- 
aches to monitor assembly processes which, however, employ highly specialized 
sensing tools like laser-range-flnders [7j. 




(a) (b) 

Fig. 1. Examples of feasible assemblies in the construction scenario. 

All these approaches either depend on detailed geometric information or re- 
quire structural knowledge about individual assemblies. Our system, however, 
should react to instructions in reasonable time but visually determining the geo- 
metric structure of an assembly is rather time consuming. Moreover, it is impos- 
sible to provide a detailed structural description for every imaginable assembly 
in our scenario. To cope with these difflculties we propose a grammar-based 
method to model assembled objects which is described in the following. 

2 Assembly Modeling by Discourse Grammars 

2.1 Motivation 

Syntactical approaches to pattern recognition problems are used for many years. 
They allow to classify complex patterns of interrelated primitives and yield a 
structural description simultaneously. Furthermore, the possibility to define re- 
cursive production rules generally leads to particular compact representations of 
classes of patterns |n|. 

Grammatical methods have also been introduced to computer aided manu- 
facturing applications like workpiece or machine tool design . However, to the 
best of our knowledge grammar-based approaches have not yet been used to 
qualify the internal structure of mechanical assemblies. 

Our efforts towards this end start from the following consideration: putting 
mechanical parts together means to connect them via some mechanical features. 
These mating features are well known in assembly modeling where their geo- 
metrical aspects are of primary concern PJ. An examination of their functional 
aspects reveals, however, that mating features also allow to specify recursive pro- 
duction rules: from a mating feature point of view an assembly consists of several 
functional units each providing features necessary to realize stable connections. 
Similarly, an assembly is composed of several subassemblies which -as pointed 
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out in the introduction- may be elementary parts or assembled objects. Thus, 
from a functional viewpoint elementary parts and assemblies can be treated 
equally: both must provide mating features to be assembled. 

This observation immediately leads to a recursive method to describe me- 
chanical assemblies. Consider for example the bolt-nut type assemblies depicted 
in Fig. n Objects like rings, bars or sockets (called miscellaneous objects in the 
following) can be put onto bolts and are fastened using nut-type objects like 
cubes or rhomb-nuts. As Fig. 1(b) indicates an assembly consists at least of a 
bolt and a nut and optionally of some miscellaneous objects and may serve as 
a bolt, a miscellaneous object or a nut itself. Thus, understanding assemblies to 
be composed of a bolt-part, an optional miscellaneous-part and a nut-part results 
in the following grammar for bolted assemblies: 



ASSEMBLY - 


4 BOLT_PART NUT_PART | 

BOLT_PART MISC-PART NUT_PART 


(1) 


BOLT.PART - 


4 ASSEMBLY | BOLT 


(2) 


NUT_PART - 


4 ASSEMBLY | CUBE \ RHOMBNUT 


(3) 


MISC-PART - 


4 ASSEMBLY | BAR \ FELLY \ RING \ SOCKET \ 
ASSEMBLY MISC-PART | BAR MISC-PART | RING 


MISC-PART 1 




FELLY MISC-PART | SOCKET MISC-PART 


(4) 



Production rule dU) describes the composition of a bolted assembly and reflects 
the fact that the miscellaneous-part is optional. The second and third rule state 
that the bolt- and the nut-part consist either of a (sub) assembly or a corre- 
sponding elementary object. Rule describes how the miscellaneous-part of 
an assembly is constructed: it is either given by a single part or a subassembly 
or a possibly infinite sequence of those. In reality, of course, the number of mis- 
cellaneous objects depends on the length of the corresponding bolt. This fact is 
neglected here but is captured by a unification process (see section I2.:tll . 

In fact, this grammar produces linear structures and is rather simple. Even 
though sophisticated techniques like graph grammars seem more appropriate we 
will show in the following that this simple grammar is nevertheless sufficient 
to describe 3D objects resulting from a construction process. Moreover, gram- 
matical assembly models are not restricted to cases with one class of mating 
features or to binary mating relations. Figure 0shows another (rather didactic) 
assembly domain to illustrate how the principles deduced for bolted assemblies 
can be generalized to n-ary mating relations and several types of connections. A 
hammer as depicted in Fig. |2)c) consists of subassemblies which are connected 
via different mating features. To construct the handle a ternary mating relation 
comprising a chock-part a ring-part and a shaft-part has to be established while 
the head of the tool is connected to the handle via a snap fit mechanism. 



2.2 Discourse Grammars - Assembly Grammars 

As mentioned above this work is associated with research in man-machine com- 
munication where speech processing is of especial importance. It were these 
fortunate circumstance that revealed a similarity between discourse and assem- 
bly structures. In this section some common known facts of discourse analysis 
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ASSEMBLY 

SHAFT.PART 
CHOCK.PART 
RING.PART 
SNAP.PART 
SNAPPED.PART 

ASSEMBLY 

r SNAP PAET SNAPPED PART 

T I " 

SLEDGEHEAD ASSEMBLY 

CHOCK PART RING PART SH^t PART 

r T r 

CHOCK TIGHTENINGRING SHAFT 

Fig. 2. Another assembly domain. 

are outlined to motivate that the underlying principles for processing discourse 
can be directly transformed to a computational framework for reasoning ab- 
out assemblies. Intuitively, discourse grammars specify how structural units are 
combined due to some tools for investigating discourse structure and discourse 
relations P3|. It is generally agreed that discourse has a recursive structure i.e. a 
discourse structure is recursively built based on discourse relations which can be 
stated between smaller discourse segments |0|. Therefore, a discourse grammar 
consists of a sentence grammar, mostly chosen to be unification based, and a 
set of (binary) discourse grammar rules, i.e. a set of rules describing discourse 
relations. This defines a framework in which both intra- and intersentential con- 
straints can be expressed, i.e. a framework which integrates discourse constraints 
together with syntactic and semantic constraints. Moreover, often the processed 
discourse structure is only partially available for further processing. This me- 
ans that only subparts with special properties are labeled as admissible parts 
for subsequent interpretation unj. Summarizing, discourse grammars generate 
a structure based on semantic and syntactic knowledge which captures various 
discourse phenomena 0. 

To show that discourse theory can be transformed into a framework for the 
description of the internal structure of assemblies the structural properties of as- 
semblies must meet the requirements of discourse. That is, assemblies must have 
a recursive structure in that bigger units can be obtained by recursive sequencing 
and embedding of smaller ones. Additionally, this recursion must be defined by 
a relation which holds between subassemblies. As outlined in section ^ recursion 
is naturally given for mechanical assemblies due to the properties of the con- 
struction process. Furthermore, a relation can be stated between subassemblies 
which is given by common elementary objects: subassemblies can be conjoined 
to yield larger units iff these subassemblies share common elementary objects. 

2.3 The Processing Model 

The approach for recognizing assembled objects is based on a LR(l)-parser and a 
unification grammar HH. The advantage of this approach is that the unification 
grammar puts most of the syntactic information that is standardly captured in 



^ SHAFT.PART RING.PART CHOCK.PART | 
SNAP.PART SNAPPED.PART 
— SHAFT 
CHOCK 

ASSEMBLY I TIGHTENINGRING 
PICKHEAD I SLEDGEHEAD 
ASSEMBLY I TIGHTENINGRING 



(«) (Dy 

@ 



®y 



(b) 



(Dl 



1: CHOCK 

2: TIGHTENINGRING 
3: SHAFT 
4: SLEDGEHEAD 
5: PICKHEAD 



(c) 
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context free phrase structure rules into the lexicon. Every word is represented 
in the lexicon by a feature structure which specifies the values for various attri- 
butes. In the assembly approach the lexicon contains all elementary objects and 
the feature structure for each object specifies values for various attributes de- 
scribing the structural properties of this object. A LR(l)-parser standardly uses 
two actions: shift and reduce. The reduce action reduces the right hand symbols 
of a grammar rule to the left hand symbol. Our model is based on a LR(1)- 
parser which is augmented with the facility to handle feature structures HOj. 
This implies that every reduce action includes a unification of the corresponding 
feature structures yielding structural descriptions of subassemblies. The shift ac- 
tion pushes the input symbols from the input string onto the processing stack. 
In order to ‘parse’ three dimensional objects an order to shift recognized objects 
of a given scene onto the processing stack must be defined, i.e. a multi dimen- 
sional signal must be transferred into a linear sequence of primitives. Generally, 
this order depends on the domain but as a rough estimate it is always stated 
on necessary elementary objects. These are objects that must be found in every 
subassembly (e.g. bolts in bolted assemblies) and constrain the range within 
the signal where parsing has to proceed next (for a detailed description of our 
linearization heuristic see section 0 • 

The basic idea of our processing model is to try to find small units in the input 
signal (similar to discourse segments in speech parsing) and to conjoin them 
with earlier found ones, if possible. This means: based on the shift and reduce 
actions of the LR(l)-parser the process of parsing subparts of the signal yields 
derivation trees (or structural descriptions) of subassemblies which are conjoined 
with already derived descriptions, if necessary. Subsignal parsing terminates if 
the signal is processed completely or if the unification fails. A unification error 
occurs if two objects should be assembled due to the LR(l)-parser grammar 
but the subcategorization frame of the corresponding feature structure of one of 
these objects is already satisfied (e.g. a bolt and a cube will not be combined 
if the cube cannot absorb another bolt since all its holes are used already). If 
parsing a subpart of the signal terminates two different cases can occur: 

1. No other subassembly was recognized so far. Therefore, the parsing process 
restarts at another part of the signal, i.e. with another necessary elementary 
object, if existing, otherwise the complete assembly is recognized. 

2. Other subassemblies were already recognized. Therefore, it is tested if the 
subassemblies recognized earlier can be conjoined with the recent one to form 
a larger assembly. After that, the parsing process is continued at another 
necessary elementary object, if existing. 

In order to merge subassemblies a conjoin relation is stated between deriva- 
tion trees (also denoted assembly trees) to integrate already derived trees (which 
are called current assembly trees according to discourse theory terminology) into 
incoming assembly trees (which is another denotation for the recently derived 
assembly tree). 

Definition: The conjoin relation A combines two assembly trees X and Y 
if both share an elementary object. The assembly tree X is integrated into Y at 
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node n which is labeled with the common elementary object and which occurs 
at the frontier of the incoming assembly tree Y. This means in YAX the node 
n of Y is substituted by the assembly tree X (see Fig. m |3(c)| and |3(d)| ) . 

In our processing model this relation is realized by an additional LR(l)-parser 
action, so-called conjoin. If an assembly tree is derived the conjoin action first 
searches for common elementary objects in the current assembly trees and then 
-if common elementary objects are found- conjoins both assembly trees. Since 
the head node of every subassembly tree is labeled with the elementary objects 
contained in the highest hierarchical level of the subassembly only nodes along 
the path given by the head node have to be searched for common objects. This 
conjoin action is performed on every elementary object of the incoming assembly 
tree and for all current assembly trees 0 Three different cases can occur: 



1. No common elementary object is found in the current assembly trees, i.e. 
only subassemblies were recognized so far which are not connected to the 
incoming one. (see Fig. |3(a)| and |3(b)| ). 

2. The conjoin relation can be stated between a current assembly tree and 
the incoming assembly tree. Therefore, the current assembly tree will be 
integrated into the incoming one (see Fig. |3(d)| ). 

3. More than one common elementary object can be found in a current as- 
sembly tree. Consequently, every node labeled with an elementary object in 
the incoming assembly tree will be substituted by the current assembly tree. 
Because the same current assembly tree is integrated several times into the 
incoming one, all occurrences of the current assembly tree in the incoming 
one are identical 0 . This corresponds to cases where assemblies are connected 
via several elementary objects. Therefore, all occurrences of a current assem- 
bly tree in an incoming assembly tree are unified to one singleton subtree. 
An example of this case is shown in Fig. |5(g)| 



3 Recognition 

In this section we illustrate the processing model for the visual recognition of 
assembled objects by means of the example depicted in Fig. |5(a)| (note that the 
object labels were added manually to facilitate the discussion) . 

As a syntactic method the assembly recognition procedure depends on pri- 
mitives which in our case are elementary objects. Following a hybrid approach 
that combines neural and semantic nets Kummert et al. m realized a fast 
and reliable method to recognize such primitives yielding labeled image regions 
as shown in Fig. |5(b)| Since assembled objects are physically connected they 
appear as clusters of adjacent regions once the recognition of single objects is 
completed. In order to recognize assemblies by parsing clusters of regions their 
elements have to be examined in a sequential manner. Generally, a method to 

^ After a conjoin action is performed the comparison process is continued on the conjoined assembly 
tree, for than the new incoming tree, and the remaining current trees, including the just conjoined 
current tree, in case that it contains several common elementary objects. 

^ This can be easily verified by the head nodes of each integrated current assembly tree, because 
they all are labeled with the same elementary objects. 
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A: ASSEMBLY (Bl, Cl) 



BOLT.PART NUT.PART 



B: 



ASSEMBLY (B2, C2) 




BOLT.PART NUT.PART 

1 1 
BOLT2 CUBE2 



(a) 



C: ASSEMBLY (B3, BAR, C2) 




BOLT.PART MISC.PART NUT.PAET 



BOLT3 BAR | CUBE2 



(b) 

CAB: ASSEMBLY (B3, BAR, | C2| ) 

BOLT.PART MISC.PART NUT.PART 

BOLT3 BAR ASSEMBLY (B2, | C2| ) 

boltJpart NUT.PART 

1 ‘I 

BOLT2 CUBE2 



(c) (d) 

Fig. 3. |3(a)||3(r]] and p(c)| Assembly trees of singleton subassemblies. p(d)| Subassem- 
blies C and B share object CUBE2 thus the corresponding trees are conjoined. All 
structures describe subassemblies of the assembly shown in Fig. |5(a)| 



order two-dimensionally distributed primitives is domain dependent and has to 
be found heuristically. In our application, simple topological considerations lead 
to a straightforward linearization technique: objects that are put onto bolts are 
linearly arranged in space, thus the centers of mass of regions representing single 
objects are ordered approximately linear (see the crosses in Fig. |5(b)| ). There- 
fore, starting parsing at regions representing bolts and then choosing an adjacent 
region provides a search direction. If there are several adjacent regions different 
alternatives for parsing must of course be considered. After examining a certain 
region in the parsing process the next object to be examined must be adjacent 
to the recent one and if there are several candidates the nearest one complying 
with the search direction is chosen. If an adjacent object depicts a bar this heu- 
ristic has to be modified since the positions of the holes have to be considered 
instead of the center of mass. Obviously, holes that have been used in construc- 
tion are not visible, however, virtual positions as emphasized in Fig. |5(b)1 can be 
estimated by dividing the bar’s main axis into equidistant ranges. 

Figures [5(c) |5(f) show the intermediate results and the final outcome of the 
assembly recognition procedure for the example. After starting the process at 
BOLTl the next object to be examined due to the heuristic is CUBEl. Accor- 
ding to the rules of the grammar both objects form an assembly (see the cor- 
responding assembly tree of subassembly A in Fig. |3(a)D thus a white polygon 
comprising both labeled regions is output (Fig. |5(c)| ). Then, another necessary 
elementary object, i.e. a bolt (here: BOLT2), is chosen as the starting point for 
a parsing process leading to the description of subassembly B in Fig. |3(b)| and 
the recognition result depicted in Fig. |5(d)[ 

Continuing at BOLTS yields assembly tree C in Fig. |S(c)| which shares an 
elementary element with subassembly B. Consequently, both descriptions are 
unified leading to the description CAB in Fig. 3(d) and the result depicted in 
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D: ASSEMBLY (B4, BAR, Cl) 
BOLtJpART MISC.PART NUirpART 

I I 1 

BOLT4 BAR CUBEl 



(a) 



(DA(CAB))AA; assembly (B4, bar, c i) 




BOLT.PAET MISC.PART NUT.PART BOLT.PAET NUT.PART 

] 1 1 1 '1 

BOLTS BAR ASSEMBLY (B2, C2) BOLTl CUBEl 




BOLT.PART NUT.PART 

1 "I 

BOLT2 CUBE2 

(b) 

Fig. 4. Assembly trees describing a singleton subassembly and the whole assembly 
depicted in Fig. |5(a)1 

Fig. |5(e)| Note that black polygons surround the objects of recently found com- 
plex subassemblies. Finally, continuing the parsing process at BOLT4 detects 
subassembly D shown in Fig. |4(a)1 Since D shares elementary objects with subas- 
semblies A and CAB the corresponding conjoin operation leads to the assembly 
tree (DA(CAB))AA describing the whole assembly and a polygon comprising 
all the subassemblies is generated (see Fig. |5(f)| ). 

4 Conclusion 

Structurally analyzing mechanical assemblies typically yields a component hier- 
archy where higher-level components consist of lower-level parts which again 
may have internal structure. This paper presented the idea to use grammars in 
order to describe the hierarchical structure of individual assemblies. Analyzing 
mechanical mating features revealed a functional equivalence of elementary ob- 
jects and subassemblies and resulted in compact grammatical rules describing 
whole classes of assemblies. 

Grammars together with a suitably defined lexicon comprising facts about 
mechanical elements allow an automatic derivation of structural descriptions 
of assembled objects. Employing techniques known from discourse theory we 
introduced a parsing model to recognize assembled objects in image data. Based 
on a domain dependent heuristic clusters of recognized elementary objects are 
analyzed sequentially and structures of detected subassemblies are conjoined if 
necessary. Thus, if all elementary objects comprising an assembly are visible a 
reliable recognition of assembled objects is possible which allows to supervise 
the intermediate steps of a construction process. 
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(g) (h) 

Fig. 5. |5(a)| An assembly with 4 bolts. Results of a single object recognition pro- 
cedure and highlighted portpoints of the bar. |5(c)] Recognition of a simple subassembly. 
|5(d)| Recognition of a second simple subassembly not connected to the first one. |5(e)| 
Recognition of a further subassembly sharing a part with the second one, thus they are 
conjoined. |5(f)| Detection of a subassembly sharing parts with the first and third one, 
i.e. recognition of the whole assembly. |5(^ and |5(^ Another feasible assembly and the 
recognition result. 
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Further work concerning grammar based assembly recognition must consider 
the problem of perspective occlusions. I.e. the parsing model must be extended 
such that fuzzier input data like incorrect or missing results of the elementary 
object recognition procedure does not affect the correct detection of assembled 
objects. To this end we hope to benefit from ideas to deal with incorrect or 
incomplete speech input El- 
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Abstract. Context-Free Grammars are the object of increasing interest 
in the pattern recognition research community in an attempt to overcome 
the limited modeling capabilities of the simpler regular grammars, and 
have application in a variety of fields such as language modeling, speech 
recognition, optical character recognition, computational biology, etc. 
This paper proposes an efficient algorithm to solve one of the problems 
associated to the use of weighted and stochastic Context-Free Gram- 
mars: the problem of computing the N best parse trees of a given string. 
After the best parse tree has been computed using the CYK algorithm, a 
large number of alternative parse trees are obtained, in order by weight 
(or probability), in a small fraction of the time required by the CYK 
algorithm to find the best parse tree. This is confirmed by experimen- 
tal results using grammars from two different domains: a chromosome 
grammar, and a grammar modeling natural language sentences from the 
Wall Street Journal corpus. 

Keywords: Weighted Context-Free Grammars, Stochastic Context-Free 
Grammars, CYK Algorithm, N Best Parse Trees. 



1 Introduction 



Syntactic pattern recognition makes use of formal languages theory to describe 
the underlying structure of pattern classes, in applications where the relations- 
hips between primitive elements are important PE!. Stochastic grammars are 
used to model the fact that some structures and patterns are more frequent 
than others. In this framework, the stochastic Context-Free Grammars (CFGs) 
are the object of increasing interest in the research community in an attempt to 
overcome the limited modeling capabilities of the simpler regular grammars, even 
though this implies using more costly training and parsing algorithms. This is the 
case for instance of language modeling in speech recognition/understanding ^ 
or RNA modeling in computational biology m- 
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This paper proposes an efficient algorithm to solve one of the problems as- 
sociated to the use of weighted CFGs, formally stated in m the problem of 
computing the N best parse trees of a given string, sorted by weight. A parti- 
cular case of this problem is the computation of the N most likely parse trees 
when G is a stochastic CFG, that has proved useful for improved training of 
stochastic CFGs mm and could have as many applications as the use of the 
N best decodings in current speech recognition systems (re-scoring using more 
accurate models, improved acoustic training, etc.). 

The proposed algorithm works for weighted CFGs in Chomsky Normal Form 
(CNF), but this does not imply any loss of generality because any weighted 
CFG can be automatically converted into this form [3|. The best parse tree is 
computed by means of a well-known version |bl9| of the Cocke- Younger-Kasami 
(CYK) algorithm 0 (sometimes called Viterbi-style parsing) described in (0 
Once the best parse tree has been computed, the N best parse trees can be 
computed by the algorithm presented in m The experimental results, reported 
in m show the practical efficiency of this algorithm. 

2 Notation and Problem Formnlation 

Let G = {V, S, S, P,w) be a weighted Context-Free Grammar in CNF |dl4| . 
where Y is a finite set of nonterminal symbols, Y is a finite set (disjoint from V) 
of terminal symbols. S' S Y is the start symbol, P is a finite set of productions 
of the form A — >■ a with A G V and a G (V x V) U U, and w : P — >■ IR is a 
function that assigns a weight to each production in P. 

Given G and given a string x = X\X 2 ■ ■ - X\x\ G ^'*'5 where |a:| denotes the 
length of X, let us define a set T of binary trees whose nodes are of the form 
Ai-k with A G V and 1 < f < fc < \x\, and a weighting function LF : T — >■ IR, as 
follows: 

(i) If there is a production A — >■ Xi in P then the tree (Ai,i) with the single node 
Ai-i is in T, and has weight W{{Ai-i)) = w{A — >• Xi). 

(ii) If there is a tree Ti with root Bi-j in T, a tree T 2 with root Cj+i:k in T, and 
a production A — >■ BC in P, then the tree (Ai:k,Ti,T 2 ) with root Ai-k, left 
subtree Ti, and right subtree T 2 is in T, and has weight W{{Ai-k,Ti,T 2 )) = 
W{Ti) + W{T 2 ) + w{A -G BC). 

A tree in T whose root is Ai-± is a partial parse tree representing a derivation 
of the substring Xi . . .Xk from the nonterminal A. A parse tree for x according to 
G is a binary tree T G T whose root is <S'i.| 2 .|. The best parse tree is the parse tree 
of minimum weight. The N best parse trees are the N parse trees of minimum 
total weight. The problem we study in this paper can then be formulated as: 

Given a weighted Context-Free Grammar in CNF G, given a string x G 
Y+ and given a positive integer N , find the N best parse trees for x in 
order by weight. 
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Let us denote T'^{Ai.k) the n-th best tree among those in T that have root 
Ai-k, and let W'^{Ai-k) be its weight. The problem is then finding T^(S'i.|a;|), 

A particular case of this problem is the computation of the N most likely 
parse trees for x when G is a stochastic Context-Free Grammar m- In this 
case, a function p : P — >■ IR assigns a probability to each production, verifying 
0 < p{A — t a) < 1, for all A — a G P, and a) = 1, for 

all A G y. If P is a parse tree for x, its probability is the product of the 
probabilities of all the productions involved in its construction. If we assign to 
the productions in P a weight w{A -A a) = — log(p(A -A a)) then maximizing 
products of probabilities becomes minimizing sums of weights, and the N parse 
trees of maximum probability are the N parse trees of minimum weight. 

A problem closely related to the computation of the N best parse trees is the 
enumeration of parse trees until the best one satisfying some desired restriction 
is obtained, without fixing a priori a value for N. The algorithm that we present 
in m also solves this problem. 

3 Computing the Best Parse Tree 

The CYK algorithm was initially proposed by Cocke, Younger, and Kasami to 
solve the problem of, given a Context-Free Grammar G in CNF (not necessarily 
weighted) and given a string x G Y+, determine whether there is a parse tree 
for X according to G or not 0. The CYK algorithm can be easily modified to 
compute the best parse tree when G is weighted, on the base of the following 
recursive equations jbfolj . 

Recursive Equations For every A G V and 1 < i < fc < |a;| the best parse 
tree with root Ai-k is 

{ argmin W{T), if fc > i, 

( 1 ) 

{Ai-i), if k = i and A ^ Xi G P, 

where, for k > i, 

7^{A,:k) = {{A,-.k,T\B,,j),T\G,+i,k)) : A ^ BG G P,i < j < k} (2) 

denotes a set of candidates to be the best parse tree with root Ai-k- According to 
the definition given in the weight of a tree T = {Ai-k,T^ {Bi-j) {G j+i-k)) 

in 7^{Ai-k) is W{T) = {Bi-j) + W^ {G + w{A -A BC). The weight of the 

best parse tree with root Ai± is 

f min W(T), iik > i, 

W\U.k) = { ( 3 ) 

I w{A — )> Xj), iik = i and A ^ Xi G P. 

If k > i and is empty, or k = i and A ^ Xi ^ P, then T^{Ai-k) does 

not exist. 
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1 : Algorithm CYK for weighted CFGs 
2 ; for i := 1 to |x| do 

3; for all A ^ Xi \n P do 

4 : T\Ai,i) ■- {Ai,i) 

5 ; for Z ;= 1 to |x| — 1 do 
6; for i := 1 to \x\ — I do 
7; k := i A I 

8; for all A in Y do 

9: T^{Ai,k) ■■= argminj,gTi(Ai^fc) W{T) 

10 : return 

Fig. 1. CYK algorithm for weighted CFGs. Given a weighted grammar {V, S, P, S, w) 
in CNF and a string x, the algorithm returns its best parse tree. 



CYK Algorithm The problem of computing the best parse tree for x con- 
sists then in solving the equations (P^3) to find T^(5 'i:| 2;|). The CYK algorithm 
(Fig. [ID is a dynamic programming algorithm that computes T^(Ai±) iteratively 
for increasing values of the length I = k — i, thus guaranteeing that the right- 
hand sides of the equations have been previously computed when they are going 
to be used. Its running time is 0(|a:p|P|), and the required space is 0(|a:p|y|), 
where |P| is the number of productions and |K| is the number of nonterminals 
in G. 



4 Computing the N Best Parse Trees 



Let us now consider how to calculate T’^{Ai:k) in general for A A V , 1 < i < 
k < |x|, and 1 < n < A. In particular, T”(S'i:|a,|) for 1 < n < A will be 
the solution to the problem we are addressing. Let us assume in what follows 
that |a;| > 1 (otherwise T^(S'i.| 3 ,|) does not exist). We will first generalize the 
recursive equations given in m to compute the A best parse trees, and then we 
will propose an algorithm to solve the generalized equations. 



Recursive Equations Let us study which trees should be considered candida- 
tes to T”(Ai:fc). Clearly, T^{Ai-k) does not exist for n > 1 if A: = * (while T^{Ai:k) 
will exist or not depending on whether there is a production A — >■ Xi in P, as we 
have seen in Let us then examine the case k > i. It should also be clear that 
in order to calculate T'^{Ai-k) we do not need to consider the trees of the form 
{Ai.i~,TP{Bi.j),T'^{Cj+i:k)) with p > n or q > n (because there are at least n 
trees among those with p < n and q < n, with lower or equal weight). Therefore, 
T^(Ai-,k) can be chosen as the best tree different from T^{Ai±), . . . ,T"“^(Ai:fc) 
in the set 



{{A,,k, TP{B,,j),T%Cj+i,k)) ■.A^BCePA<J<kA<p<n,l<q<n}. 
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Fig. 2. Schematic representation of the partial order among some candidate trees. 



But we can do quite better than computing the N best parse trees for every 
possible root Ai:k, because there is a partial order defined among some elements 
of this set of trees. Based on the relations (schematically represented in Fig.Ej) 

WP{B,,,) + W\Cj+i,k) < WP+\B,,,) + W\Cj+v.k), (4) 

WP{B,.,,) + W^Cj+i,k) < WP{B,:,) + lF«+i(C,+i:fe), (5) 

we can define a smaller set T"(Ai:fc) of trees with root Ai-^ in such a way that 
we still have the guarantee that T'^{Ai.}^) is the best among them. Let 7^{Ai.k) 
be the set of trees defined in ( 0 . For n > 1, let us assume that T" ^{Ai-k) = 
(A,:fc,rP(S,:,),r'3(C,+ufc)), and if g = 1, let 

T"(A,:fc)=(T’^-l(A,.fc)-{r”-l(A,:fc)})U{(^,:fc,rP(B,:,),r9+l(Q+i:fc))} 

U{{A..k,TP+\B,..,),T<^{C,+i..k))}; 

otherwise (if q > 1), let 

‘T{A,k)= (fr-l(7l,:fc) - {r"-l(A,:fe)})U{(A,:fc,rP(S,:,),r«+l(Q+i:fc))} (7) 

assuming always that {{Ai-k, Ti, T 2 )} denotes the empty set if Ti or T 2 does not 
exist. Then we have 



T^{Ai,k)= argmin W{T), 


(8) 


lF"(7l,,fc) = min VF(T), 

TGT"(Ai:C 


(9) 



if 7^{Ai-k) is not empty (otherwise T^{Ai-k) does not exist). 



Recursive Enumeration of the N Best Parse Trees The problem of com- 
puting the N best parse trees consists then in solving the equations 00 to 
find T^(5'i:|a;|), T^(5 'i:|2;|), . . . , T^(5'i:| 2;|). The algorithm in Fig. 0 solves them 
for increasing values of n, after the best parse tree has been computed by the 
CYK algorithm, recursively starting from the node S'i:|a,|. 

The algorithm makes use of the recursive procedure NextTree. For n > 1 
and k > i, and once T”“^(Ai,fc) = {Ai.k,TP{Bi-,j),T'^{Cj+i-,k)) is available, 
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Ai: Algorithm Recursive enumeration of the A^-best parse trees 

A 2 ; Compute T^(Ai:k) for all A £ V , 1 < i < k < \x\ using the CYK algorithm 

A3; for n := 2 to A do NextTree{T"~^{Si:\x\),n) 

A4: return . . . , T"^(Si;|,|)} 

Bi: procedure NextTree{{Ai.,k,T^{Bi,j),T’'(Cj+i,k)),n) 

b 2 ; if n = 2 then 

B3: 7[Ai,t,]~{{Ai,t,,T\Bi,i),T\Cj+v.k)) -.A^BCsPA^jK k} - {T^Ai-.k)} 

B4; if g = 1, j > i, and (Bi-.j) has not been computed then 
b5; NextTree{T^ [Bi-j),p + 1) 

B6; if q = 1 and (Bi:j) exists then 

B7: 7[Ai,k] ~ 7[Ai:k]U {{Ai:k,T^+\Bi..j),T^Cj+i..k))} 

B8; if fc > j + 1 and has not been computed then 

B9; NextTree{T'^{Cj+i:k),q + 1) 

BIO; if T'^'^^(Cj+i:k) exists then 

Bll: 7[Ai,k] ~ 7[Ai:k]U {{Ai:k,T^{Bi:j),T‘^ + \Cj + l..k))} 

B 12 ; if 7[Ai:k] / 0 then 

B13: T'^{Ai,k) ■■= argminj,gT[2ii.fc] 

B14: 7[Ai,k] ~ 7[Ai,k] - {r"(A,;fc)} 

B15; else 

B16; T'^iAi-.k) does not exist 



Fig. 3. Algorithm to compute the A best parse trees. 



NextTree{{Ai.k,TP{Bi:j),T'^{Cj+i-k)),n) computes T^{Ai-,k) according to equa- 
tion ( 0 ). In first place, it builds 7^{Ai-k) from 7^~^{Ai:k) according to equati- 
ons (00. This may require inserting in this set at most two new candidate trees: 
(A,:fe,rP+i(S,:,),r9(C,+i:fc)) and {A,,k,TP{B,,,),T‘i+\Cj+v.k)). If TP+\B,.,,) 
(or T'^+^(C'j+i:fe)) is required and has not been computed before, it is computed 
by calling NextTree{TP{Bi:j),p + 1) (or NextTree{T^{Cj+i-k),q+ !))• 

Both 7"’~^{Ai-k) and T"(Ai,fe) can be implemented by the same structure 
7[Ai.k], because once has been calculated, 7'^~^{Ai.k) is no longer 

necessary. On the other hand, the set T[Ai.fc] is initialized only when the second 
best parse tree with root Ai-^ is required, because there could be nodes Ai± for 
which it is not necessary to compute alternative trees. 

Finiteness of the recursion is guaranteed by the fact that the first arguments 
of the recursive calls are trees with root Ai-,k for decreasing values of fc — i, and 
are only performed if fc — * > 0, so that the number of recursive calls produced 
by NextTree{T^~^{Si.\x\),n) to compute T”(S'i.| 3 ,|) is at most |a;|. 



Data Structures and Implementation Issues The trees T^{Ai:k), T‘^{Ai-k), 
T^{Ai-k), ... can be stored ordered by weight in a linked list associated to 
node Ai.j^. Every tree of the form {Ai.k,TP{Bi.j),T'^{Cj+i.k)) can be efficiently 
represented in memory by just three values: its weight, a pointer to its left subtree 
TP{Bi-j), and a pointer to its right subtree T‘^{Cj+i-,k). In this way, T^{Ai-k) can 
be inserted, in step bi3, in constant time following (Ai-k) in the list of trees 
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associated with node Ai^k, and steps B4, B6, bs, and bio take constant time to 
check whether or T‘?+^(C'j+i:fc) are available. 

The only operations performed by the algorithm with the sets of candidates 
7[Ai-k] are those supported by priority queues: insertion of new elements and 
selection/deletion of the best element. Several data structures allow to perform 
these operations in time logarithmic with respect to the number of elements in 
the priority queue [7|. The results reported in ^correspond to an implementa- 
tion using leftist trees [Zj. 

An additional improvement in the algorithm is possible: of all the candidate 
trees with the same value of j, only the best one needs to be inserted in 7[Ai-k] 
when it is initialized in step B3. The rest of candidates with that value of j only 
need to be inserted if the best one is extracted (after step B 14 ). 



Computational Complexity The CYK algorithm runs in time 0(|xp|P|). 
The number of different sets 7[Ai.k] is 0(|xp|Y|) and, in the worst case, all 
of them are initialized by step B3 (in different calls to NextTree) in total time 
0(|xp|P|), because each initialization can be performed in linear time with res- 
pect to the size of the set. 

The computation of the N best parse trees requires at most A^|x| calls to 
NextTree. Each call may require to insert at most two new elements in a set 
of candidates (steps B7 and Bii), and to select and delete the best candidate 
(steps B13 and B 14 ) from it. Since no more than N trees with root Ai-,k may need 
to be computed, the size of 7\Ai.k] is bounded by its initial size plus N. Thus, 
the total time required by the whole algorithm to compute the N best parse 
trees is 0(|xp|P| -I- A^|x| log(|x||^ -I- N)). 

On the other hand, the space complexity of the algorithm is 0(|xp|P|-|-iV|x|). 

This computational complexity analysis is based on worst case assumptions 
that could be too pessimistic. In practice, it can be expected that even for large 
values of N , not all the sets of candidates are initialized and the number of 
recursive calls can be much lower than iV|x|. 

5 Experimental Results 

In order to assess the behavior of the algorithm in practice, we have performed 
experiments with strings and grammars corresponding to two different domains. 
All the experiments have been run on a 400 MHz Pentium-II computer running 
under Linux 2.2. The algorithm has been implemented in C and compiled with 
gcc 2.91 using the optimization level ‘-02’. 



Chromosome Grammar Strings of different lengths have been randomly ge- 
nerated using the chromosome grammar described in |[J §2.3.1] and assuming 
that all the productions with the same left-hand symbol have the same probabi- 
lity. The grammar in CNF has 9 nonterminals, 5 terminals, and 20 productions. 
Fig. 0 shows the observed dependency of the running time of the algorithm in 
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Fig. 4. Dependency of running time with N and |a;| using the chromosome grammar. 



milliseconds, averaged for 20 strings of each length, as a function of N (for dif- 
ferent string lengths) and as a function of the string length (for different values 
of TV). 

Figs. EJl andEb show the average time (± two times the standard deviation) 
required to compute up to 100 and up to 1000 best parse trees, respectively. 
Fig. El depicts the dependency of time with the string length for different values 
of N. Time for = 1 corresponds to the CYK algorithm. It can be clearly 
observed that once the best parse tree has been found by the CYK algorithm, 
the rest of the N best parse trees are computed very efficiently, in just a small 
fraction of the total running time. This is made more explicit in Fig.0J, where 
the time to compute from the second to the A^-th best parse tree is shown as a 
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grammar. 



percentage of the time required by the CYK algorithm to find the optimal parse 
tree. 



Wall Street Journal The algorithm has also been tested using sentences from 
the Wall Street Journal (WSJ) corpus, as annotated in the Penn Treebank [TTlj . 
A grammar with 42 terminals (part of speech tags), 14 nonterminals, and 518 
productions has been obtained using the Inside-Outside algorithm with the sen- 
tences shorter than 16 words in sections 00-19 of this corpus Sentences from 
sections 20-24 have been used to measure the running time of the algorithm to 
compute the N best parse trees with this grammar. The results, shown in Fig. 0 
(also averaged for 20 sentences of each length), are similar to those obtained 
with the chromosome grammar. 



6 Conclusions 



In this paper, a new algorithm to compute the N best parse trees for weighted 
(or stochastic) Context-Free Grammars has been presented. The experimental 
results with two different pattern recognition tasks have shown the practical 
efficiency of this algorithm, which can be used to compute a large number of 
parse trees, in order by weight, in a small fraction of the time required by the 
CYK algorithm to compute the best one. Since there is no need to fix a priori 
the value of N, the algorithm can be used to enumerate parse trees until the 
best one satisfying some desired constraint is obtained. 
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Abstract. This paper addresses the problem of structural clustering of 
string patterns. Adopting the grammar formalism for representing both 
individual sequences and sets of patterns, a partitional clustering algo- 
rithm is proposed. The performance of the new algorithm, taking as refe- 
rence the corresponding hierarchical version, is analyzed in terms of com- 
putational complexity and data partitioning results. The new algorithm 
introduces great improvements in terms of computational efficiency, as 
demonstrated by theoretical analysis. Unlike the hierarchical approach, 
clustering results are dependent on the order of patterns’ presentation, 
which may lead to performance degradation. This effect, however, is over- 
come by adopting a resampling technique. Empirical evaluation of the 
methods is performed through application examples, by matching clu- 
sters between pairs of partitions and determining an index of clusters 
agreement. 



1 Introduction 



A diversity of clustering procedures can be found in the literature El . From the 
methodological point of view, algorithms can be divided in two major classes: 
partitional methods and hierarchical methods. Partitional structure organizes 
patterns into a small number of clusters. It usually assumes the a priori spe- 
cification of the number of clusters to partition the data or the definition of 
cluster validity criteria. Hierarchical clustering consists of a sequence of nested 
data partitions in a hierarchical structure. A particular partition is obtained by 
cutting the hierarchical structure at some level. 

Concerning structural patterns represented as sequences of symbols, cluste- 
ring algorithms are extensions of these methods by adopting adequate string 
similarity measures |Z]. Viewing similarity computation as a matching process 

pn] present sentence-to-sentence clustering proce- 



, references 

dures based on the comparison of a candidate string with sentences in previously 
formed clusters (clustering based on a nearest-neighbor rule) or with cluster cen- 
ter strings (cluster center technique), respectively. String editing operations are 
there used in the transformation of strings to perform the matching. Following 
the string matching paradigm while modeling clusters’ structure using gram- 
mars, error-correcting parsing and grammatical inference are combined in a clu- 
stering algorithm described in |?SI9j . Basically it implements a nearest-neighbor 
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rule, where sentences are compared, not directly with the patterns included in 
previously formed clusters, but with the best matching elements in the languages 
generated by the grammars inferred from the clusters’ data. Also using gram- 
mars for modeling clusters’ structure, a distinct approach, based on the concept 
of minimum description, is proposed in Structural resemblance between pat- 
terns is assumed to reflect common rules of composition; a normalized reduction 
in grammar complexity obtained by associating patterns gives the measure of 
similarity underlying the hierarchical clustering algorithm there proposed. In |S] 
the search of common subpatterns by means of Solomon off’s coding |T7inj forms 
the basis of a clustering algorithm that defines similarity between patterns as a 
ratio of decrease in code length. 

This paper focuses on clustering procedures capturing structural resemblance 
in the form of rules of composition between primitives, as opposed to string mat- 
ching techniques. The grammar formalism is adopted to describe these rules and 
simultaneously provide a model for cluster representation. To this purpose, a new 
clustering algorithm, of the partitional type, is proposed and compared with the 
hierarchical method described in Also, within the scope of empirical com- 
parison of partitions produced by the different algorithms, a global partitioning 
agreement index is proposed. 

Section |3 presents the grammar-based similarity measure that forms the core 
of the clustering algorithms, emphasizing the distinction between structural re- 
semblance and string matching. The new clustering algorithm is described in 
section 0; theoretical algorithmic complexity evaluations and comparative per- 
formance analysis are addressed in section 0 A measure of partitions agreement 
is proposed for empirical assessment of the methods. Application examples are 
presented in section 0 



2 Structural Similarity Measure 



The concept of resemblance between strings has typically been viewed from 
two perspectives m (1)- similarity as matching, based on string editing opera- 
tions; (2)- structural resemblance, based on the similarity of their composition 
rules and primitives. 

The similarity measure described next, which forms the basis of the clustering 
algorithms described in section 0 falls into the second category. According to 
this approach, patterns’ structure is modeled by syntactic rules, automatically 
inferred from the data mm- Grammar complexity gives a measure of the 
compactness of this representation: 



r li 

C{G) = 

i=i j=i 



( 1 ) 
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where C'(G) is the complexity of grammar G, dij represents the right side of the 
jth production for the ith non-terminal symbol of the grammar, and 

m 

C{a) = {n + l)log{n -I- 1) — kilogki (2) 

i=l 

with ki being the number of times that the symbol appears in a, and n is 
the length of the grammatical sentence a. Structural resemblance is captured by 
shared rules of composition, which lead to a reduction of the global description 
(grammar inferred from the patterns ensemble) when compared to the descrip- 
tions of the patterns considered individually. Similarity is then defined as the 
ratio of decrease in grammar complexity (RDGC) 0, as follows: 



RDGC(sx,S2) 






where si,S 2 are strings and G(GsJ denotes grammar complexity. Figure 
outlines the RDGG similarity computation procedure. 



( 3 ) 

ID 



String: String: 




Fig. 1. Computing the similarity between strings under the minimum grammatical 
complexity approach. Separate grammars are inferred from the individual strings and 
from the strings ensemble; similarity is defined as the normalized reduction in gram- 
matical complexity obtained by joining the patterns. The choice of the grammatical 
inference algorithm influences the similarity index obtained. For simplicity of illust- 
ration, graphs representing regular grammars are depicted next to the grammatical 
inference blocks. 
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Fig. 2. Graphical representation of strings of the form (2*6*)* (patterns 1 to 4) and 
(0*6)* (patterns 5 to 8). The graphical interpretations of the symbols are as follows: 0 
- maintain the line direction; 2 - turn right; 6 - turn left. The string’s lengths for the 
several patterns are: (l)-9; (2)-9; (3)-389; (4)-409 (5)-18; (6)-9; (7)-487; (8)-411. 



The emphasis of the RDGC on structural similarity rather than on string 
alignment is put in evidence in the example depicted on figure El which repre- 
sents a graphical description of instances, with various lengths, of strings of the 
form (2*6*)* or (0*6)*, with the symbol * indicating an arbitrary repetition of 
the element on the left (parenthesis are used for delimiting elements with more 
than one character). 
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Table 1. Similarity matrix for the string patterns in figure 1. 



Table Q shows the similarity matrix between the string patterns, compu- 
ted using Crespi-Reghizzi’s algorithm cm for grammatical inference, without 
imposing a priori information other then left-to-right precedence of the charac- 
ters. As shown, the similarity measure provides complete separation of the two 
structures (zero valued blocks on the matrix). Furthermore, the similarity index 
is independent of the strings length, yielding maximum similarity to most of 
the patterns exhibiting the same structure. The non-unit values regarding the 
similarity between string 4 and the first three strings (or between string 5 and 
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strings 6 to 8) reflect the sensitivity and asymmetry of the inference algorithm 
to initial and terminal string values with the a priori information used. 

3 Minimum Grammar Complexity Clustering 



The underlying idea for the clustering algorithms described next, is that, if 
sequences exhibit a similar structure, then their joint description will be more 
compact than the combination of the descriptions of the individually considered 
elements. Using the grammar formalism to model the string patterns, similarity 
of structures and primitives leads to shared sets of grammar productions, and 
hence to a reduction in the global grammar complexity. Taking the grammar 
complexity, as deflned in expressions Q] and El as a measure of description 
compactness, and the associated similarity between string patterns, deflned by 
expression El the later is extended to sets of sequences, providing a similarity 
measure between clusters: 



RDGC{Ci,C2) 



C(GcJ + C(GcJ-C(Gc.,cJ 

min{C{Gc,),C{Gc,)} 



(4) 



with Gci representing the grammar inferred from the data in cluster Gi. 

Section tm presents a hierarchical clustering algorithm based on this si- 
milarity concept, proposed in The new algorithm, a sentence to sentence 
clustering procedure, is presented in section IS .‘/’I These algorithms, besides the 
data partitioning, provide a model for cluster representation. 



3.1 Hierarchical Clustering 

Input: A set of strings S = {si, S 2 , • ■ ■ , Sn} and a threshold th. 

Output: A partition of S into m clusters Ci, C 2 , . . . , Cm and their gramma- 
tical representations Gci , Gca , • ■ ■ , Gc^ 

Steps: 

1. Assign Si to Ci, i = 1, . . . , n and infer a grammar, Gci , for each cluster. Let 
m = n. 

2. Among the possible associations of two clusters, compute 

sim = max {RDGC{Ci, Cj)} , i, j = 1 . . .m,i ^ j 

If sim > th, then associate these clusters, set their grammatical description 
as the grammar inferred from the joint set of data Gci,Cj and decrease m 
by one; otherwise stop, returning the clusters found. 



3.2 Sentence to Sentence Clustering Procedure 



Input: A set of strings S = {si, S 2 , . . . , s„} and a threshold th. 

Output: A partition of S into m clusters Gi, G 2 , . . . , Cm and their gramma- 
tical representations Gc^ , Gc 2 > • ■ • > • 
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Steps: 

1. Assign si, the first string in S, to C\, and infer the grammar Gci ■ Let m = 1. 

2. For each remaining element Si G S do: 

— Infer a grammar for Si and compute the similarity RDGC{si,Ck),k = 
1, ... TO. Let sim be the highest value found and Gj the matching cluster. 
— If sim > th, include Si in Cj and update the cluster’s grammatical 
description; otherwise, form a new cluster with Si and set to = to + 1. 

3. Return the clusters found and their grammatical descriptions. 

4 Comparison of the Algorithms 

The hierarchical and partitional algorithms described previously are based on 
a common similarity measure and cluster representation paradigm: syntactic 
model. They therefore provide compact models for cluster description that, while 
capturing data structure, also provide a mechanism for the recognition of new 
data by means of parsing algorithms, and generative capability. 

The demarcation of the two approaches deals with computational efficiency, 
optimality of the solutions and structure evidencing aspects. 

4.1 Computational Complexity 



Let n be the total number of samples. The computational analysis takes 
grammatical inference as elementary operation, as this is the most expensive 
processing performed. It should be emphasized that the adopted grammatical 
inference method (Crespi-Reghizzi’s method) has linear time complexity on the 
length of the patterns, 0{l). By adequate memorization of grammar structure 
information (profiles), merging of clusters involves simple calculations on these 
structures, without requiring recalculations with the samples. 

Step 1 of the hierarchical algorithm (see section 1 , 1 . 1 II involves n distinct 

/ 1 1 ^ 

grammar inferences, one per sample. The first iteration of step 2 performs ' 
inferences, filing a, n x n similarity matrix (this matrix is upper triangular with 
unitary diagonal) . The remaining iterations of step 2 undertake comparisons on 
this decreasing order matrix (each association of patterns, or clusters merging, 
dictate a reduction by one in the matrix dimension and its corresponding ac- 
tualization, involving the computation of the similarity of the merged cluster 
with all the others - to — 1 computations, to being the current dimension of the 
similarity matrix). In the overall, the hierarchical algorithm has O(n^) time and 
space complexities. 

The analysis of the steps of the partitional algorithm (section 1 , 1 . shows that 
n inferences are needed (one per sample) and the computation of the similarity 
with existing clusters and actualization of clusters’ grammars involves at the 
most TO grammar merging operations. As a result, the algorithm has 0{mn) time 
complexity and 0(m) space complexity. This represents a significant reduction 
in computational complexity in comparison with the hierarchical version as to, 
the total number of clusters is usually small in relation to the number of samples. 
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4.2 Optimality of the Solutions Found 



Since the hierarchical data structuring, produced by the hierarchical algo- 
rithm, is based on the pre-computation of a similarity matrix between all sample 
pairs, solutions found are only dependent of the value of th, a design parameter 
meaning the minimum similarity of patterns within a cluster. The partitional 
algorithm, however, may produce solutions dependent of the order of presenta- 
tion of patterns, in which case over-fragmentation of the data may occur. This 
situation is more likely to arise when clusters are not well separated, and highly 
dissimilar patterns, belonging to the same cluster, are on the top of the presenta- 
tion list, the similarity of which being smaller than the value of the threshold th. 
This dependency on the order of patterns’ presentation may be overcome by a 
combined resampling and consistent clusters gathering algorithm, not described 
here due to space restrictions. 

Empirical evaluation of the algorithms undertakes the comparison of the par- 
titions produced, which, in general, will include differing numbers of clusters and 
unlike clusters organization and ordering. In order to evaluate the consistency of 
two data partitions or to compare the results of two clustering algorithms taking 
as reference an ideal partitioning, it is necessary to determine the correspondence 
between clusters in both partitioning. In other words, one needs to determine 
the best matching associations of clusters and an index of clusters agreement. 

In the following, we define pcjidx, the partitions consistency index, as the 
fraction of shared samples in matching clusters in two data partitions, over the 
total number of samples: 

^ min{nci,nc2} 

pcjidx = — y nsharedi 
n 

i=l 

where nc\, nC 2 are the number of clusters in the first and second partitions, re- 
spectively, and n^sharedi is the number of samples shared by the ith matching 
clusters. An algorithm for the computation of matching clusters and correspon- 
ding consistency index is described elsewhere. 

5 Application Examples 

The first example consists of the patterns presented in section El Figure 0 
illustrates the failure of string matching techniques in identifying the structure of 
these patterns (dendrogram on the left); by applying the hierarchical algorithm 
of section ft. II (dendrogram on the right) perfect separation is obtained for 
threshold values, th, smaller than 0.7. With the same range of values for th, the 
partitional algorithm, described in section lit. 21 consistently produces the same 
two classes, not being dependent on the order of pattern presentation. 

The other example, depicted in figure 0 concerns the clustering of 84 contour 
images of two types of hardware tools, using string descriptions. The dendro- 
gram, produced by the hierarchical algorithm, shows the variability of similarity 
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(b) 



Fig. 3. Dendrograms produced by hierarchical clustering of patterns in figure 2, using: 
(a) - dissimilarity based on string editing operations; (b) the RDGC similarity . 
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(c) Consistency index (%). (d) Number of clusters. 

Fig. 4. Clustering of contour images represented by an 8-directional differential chain 
code, (a)- Dendrogram produced by the hierarchical algorithm; samples 1-42 and 43-84 
are of the type represented aside, (b)- Consistent clusters found after 10 resampling 
experiments, with the partitional-type algorithm, (c)- Histogram of the consistency 
index (in percentage - see section n.'Zf between the ideal partitioning and the parti- 
tions found with the proposed algorithm, over 40 resampling experiments of the data 
(random ordering), (d)- Corresponding histogram of the number of clusters found. 



values within the cluster of the first tool; class separation is obtained by choosing 
th in a narrow interval: ].21; .3[. By applying the sentence-to-sentence algorithm 
to this data, with th = .22, clustering results are dependent on the order of 
pattern presentation, as shown by the histograms in figure 0 (c) and (d). In 
these results, wheel-type contours were usually grouped in a single cluster, while 
the other object class was often fragmented into several, variable composition 
clusters (most of the experiments led to the partitioning of the data into 3 or 
4 clusters - see plot (d)). Consistency index values lower than 1 are also the 
result of overfragmentation of the data, rather than incorrect pattern associati- 
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ons. However, this overfragmentation, dependent on the order of presentation of 
patterns, can be overcome by applying a technique of data resampling followed 
by clustering and determination of consistent clusters, not described here due 
to space limitations. The results obtained using this technique are depicted in 
figure 0 (b), showing that only two patterns (number 50 and 55, in the initial 
data ordering) did not join their natural group, forming an additional cluster. 

6 Conclusions 

This paper presented a new clustering algorithm for string patterns, of the par- 
titional type, based on a minimum grammar complexity criterion. The ability 
of the underlying similarity measure to capture structural resemblance, as op- 
posed to string matching, was emphasized, total independence on the string’s 
lengths being achieved. A theoretical analysis of the new algorithm revealed lo- 
wer computational complexity {0{ricn) and 0(ric) time and space complexities, 
respectively, with n being the number of patterns and ric the number of clusters 
found) when compared with the hierarchical version of the algorithm presented 
in ^ {0{v?) time and space complexities). As a drawback, the partitioning pro- 
duced by the new algorithm is dependent of the order of pattern presentation, 
the relevance of this effect being problem dependent and subject to empirical 
evaluation. To this purpose, an index of clusters agreement in data partitions 
was proposed to assess the performance of clustering algorithms on practical 
grounds. 

The dependency on the order of presentation of the patterns of the new 
algorithm can be overcome by a combined resampling/consistent clusters finding 
technique, as illustrated by an application example, with not significant increase 
in the computational burden. Therefore, the proposed algorithm constitutes a 
feasible clustering strategy, able to handle much larger data sets than hierarchical 
techniques. 
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Abstract. Recently, a number of authors have explored the use of re- 
cursive recursive neural nets (RNN) for the adaptive processing of trees 
or tree-like structures. One of the most important language-theoretical 
formalizations of the processing of tree-structured data is that of finite- 
state tree automata (FSTA). In many cases, the number of states of 
a nondeterministic FSTA (NFSTA) recognizing a tree language may be 
smaller than that of the corresponding deterministic FSTA (DFSTA) (for 
example, the language of binary trees in which the label of the leftmost 
fe-th order grandchild of the root node is the same as that on the left- 
most leaf). This paper describes a scheme that directly encodes NFSTA 
in sigmoid RNN. 



1 Introduction 

During the last decade, a number of authors have explored the use of analog 
recursive neural nets (RNN) for the adaptive processing of data laid out as trees 
or tree-like structures such as directed acyclic graphs. In this arena, Frasconi, 
Gori and Sperduti |5| have recently established a rather general formulation of 
the adaptive processing of structured data, which focuses on directed ordered 
acyclic graphs (which includes trees); Sperduti and Starita ina have studied the 
classification of structures (directed ordered graphs, including cyclic graphs) and 
Sperduti uni has studied the computational power of recursive neural nets as 
structure processors. 

One of the most important language-theoretical formalizations of the proces- 
sing of tree-structured data is that of finite-state tree automata (FSTA), also cal- 
led frontier-to-root or ascending tree a,utomata, jbll l) . Deterministic FSTA (DF- 
STA) may easily be realized as RNN using discrete-state units such as the thres- 
hold linear unit (TLU). Sperduti, in fact, jO] has recently shown that Elman-style 
0 RNN using TLU may simulate DFSTA, and provides an intuitive explana- 
tion (similar to that expressed by Kremer jjj for the special case of deterministic 

* Work supported by the Spanish Comision Interministerial de Ciencia y Tecnologfa 
through grant TIC97-0941. 

F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 208- 171711 2000. 
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finite automata) why this should also work for sigmoid networks: incrementing 
the gain of the sigmoid function should lead to an arbitrarily precise simulation 
of a step function. More recently, we jTj have shown that a finite value of this 
gain is possible such that exact simulation of a DFSTA may indeed be performed 
by various analog RNN architectures. This paper adds a new encoding to those 
described in P, which can be used to directly encode nondeterministic FSTA 
(NFSTA); it is the case that in many cases, the number of states of a nondeter- 
ministic FSTA (NFSTA) recognizing a tree language may be much smaller than 
that of the corresponding DFSTA (for example, the language of binary trees in 
which the label of the leftmost k-th order grandchild of the root node is the same 
as that on the leftmost leaf); therefore, there may exist very compact recursive 
neural network encodings of a family of tree language recognizers. Tree auto- 
mata include string automata such as Mealy and Moore machines as a special 
case. For a detailed study of the encoding of string automata in recurrent neural 
networks, see | 2 |. 

In the following section, tree languages and tree automata are introduced. 
Section 0 describes a particular recursive neural network architecture, a high- 
order Elman-like sigmoid recursive neural network. Section^ describes the enco- 
ding of a nondeterministic FSTA into that RNN architecture. Finally, we present 
our conclusions in the last section. 



2 Tree Languages and Tree Automata 



We will denote with S a ranked alphabet, that is, a finite set of symbols S = 
{cTij..., with an associated function r : A — >■ N giving the rank of the sym- 
bolic The subset of symbols in S having rank m is denoted with Em- The set 
of A-trees, A^, is defined as the set of strings (made of symbols in A augmen- 
ted with the parenthesis and the comma) representing ordered labeled trees or, 
recursively, 

1. Aq C A^ (any symbol of rank 0 is a single-node tree in A^). 

2. /(<!,..., tm) S A^ whenever m > 0, / S A^ and S A^ (a tree 

having a root node with a label of / rank m and m children t\ . . .tm which 
are valid trees of A^ belongs to A^). 

A nondeterministic finite-state tree automaton (NFSTA) consits of a five- 
tuple A = (Q, E,r, A, F), where Q = {qi,... ,Q|q|} is the finite set of states, 
A = {tTi, . . . ,CT|^|} is the alphabet of labels, ranked by function r, F C Q is 
the subset of accepting states and A = {i5o,(5i, . . . ,Sm} is a finite collection of 
transition functions of the form 6m '■ Elm x Q"* 2^, for m G [0, M] with M 

the maximum rank or valence of the NFSTA. For all trees t G A^, the result 

^ The rank may be defined more generally as a relation r C T' x N; both formulations 
are equivalent if symbols having more than one possible rank are split. 
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S{t) £ 2^ of the operation of NFSTA A on a tree t £ is defined as 

( <5o(o) if t = a £ Eq 

■•■,9m) if t = 0 <m< M, f £ Em 

undefined otherwise 

( 1 ) 

(M is the maximum number of children for any node of any tree in L{A)). 
Deterministic FSTA are a special case of NFSTA, namely, when 5m '■ Em x — >■ 

Q. 

As usual, the language L{A) recognized by a NFSTA A is the subset of E'^ 
defined as 



L{A) = {t£E^ ■.5{t)(lF^iD}. (2) 

3 A Recursive Neural Net to Encode Tree Automata 

Here we define a recursive neural architecture that is similar to those used in 
related work as that of Frasconi, Gori and Sperduti 0, Sperduti and Sperduti 
and Starita HOI. 

A high-order Elman-like recursive neural network consists of one set of single- 
layer neural networks which computes the next state (playing the role of the 
collection A of transition functions in a finite-state tree transducer) and one 
single-layer feedforward neural network with a single output node which detects 
the existence of an accepting state in the set of states active after processing 
a tree. The schematics of this recursive neural network architecture are shown 
in Fig. ^The next-state function is realized as a collection of M -|- 1 high-order 
single-layer networks, one for each possible rank m = 0,... ,M, having nx 
neurons and m -|- 1 input ports: m for the input of subtree state vectors, each of 
dimensionality nx, and one for the input of node labels, represented by a vector 
of dimensionality nj/. 

The node label input port takes input vectors equal in dimensionality to the 
number of input symbols, that is nu = |I7|. In particular, if /r is a node in the 
tree with label Z(/i) and u[/i] is the input vector associated with this node, the 
component Uk[fj] is equal to 1 if the input symbol at node fj, is cr™ G Em (the 
fc-th symbol of rank m) and 0 for all other input symbols {one-hot or exclusive 
encoding) . 

For a node fi with label l{fi) £ Em and children vi,...,Vm the next state 
x[/r] is computed by the corresponding m-\- 1-th order single-layer neural net as 
follows: 



= 9 \ wT + '^'^ ■ 

fe=i ji=i 



nx 
jm — 1 









WliA 



( 3 ) 
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Fig. 1. A high order Elman-like recursive neural network having m state ports with 
nx units, one label port with nu units, and a single output unit (the acceptance unit). 



where in™ represents the bias for the network of rank m and z = 1, . . . , nx and 
the are the weights for that network. If /z is a leaf, i.e., l{^) G Eq the 

expression above for the component Xi[^] reduces to 



Xi[^A = 9 



w? 



nu 



°kUk[fA 



(4) 



that is, there is a set of |li| weights of type ...,zc° j.) which play the 

role of the initial state in recurrent networks 0 . 

Acceptance is expressed by a single unit connected to all of the state units: 



. (5) 



4 Encoding Tree Automata In Recursive Neural 
Networks 

4.1 Using Threshold Linear Units 

Here, we present a way to encode a NFSTA in a RNN as the one defined above 
using TLU as activation functions {g = 9, with (9{x) = 1 if a; > 0 and 0 
otherwise). The encoding is based on the following scheme for states. Each of 



y[tA = ^ + E 
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the M next-state networks in the RNN will have nx = \Q\ state units and 
njj = iSml and will be interpreted as being in state qi € Q after reading tree /t 
if Xi[^] is high, and as not being in state Qj G Q if Xj[^] is low. In this way, the 
RNN, even if it is a deterministic device, may be interpreted as being in more 
than one state of the NFSTA. 

Next-state networks: Weights are chosen as follows: biases on all state units 
of all of the next-state are equal to —1/2, to bring state units to zero if they do 
not receive any sufficiently positive contribution. Weights j^k 1 if G 

,qji,qj 2 T ■ ■ jQjm) and cr™ is the fc-th symbol of Sm, and 0 otherwise. In 
this way, the unit i representing state Qi will be high when (a) the /c-th symbol of 
Em labels the node, (b) the sets of states at the m children nodes include each one 
at least state (that is unit xj^ is high) , and (c) g* G dm (ct™ , qj^ ,qj2, - ■■ , qj„ ) • 
In that case, unit i receives a positive contribution (perhaps not the only one) 
and becomes high. If there is not any such contribution, it remains low. This 
construction, therefore, emulates the next-state functions Sm of the NFSTA. 

Acceptance unit: This unit has a bias v = —1/2 to keep it low in the absence 
of sufficiently positive inputs. Weights Vj are 1 if qj G F and zero otherwise. In 
this way, if at least one of the states at the root node of a tree is an acceptance 
state, the output will be high, but zero otherwise. 



4.2 Using Sigmoid Units 



Our main hypothesis is that substituting in the previous construction the step 
function d by a scaled version of the logistic sigmoid function 



gbiHx) 



1 

1 -I- exp(— iJx) 



(6) 



where iJ is a positive gain, there is a finite value of that ensures a correct 
NFSTA behavior of the resulting RNN (obviously, in the limit i? — >■ oo the 
behavior is correct because one recovers the step function 9). We will look for 
the smallest value of H that guarantees correct behavior. 

For a binary interpretation of the output of all sigmoid units, we will use the 
following criterion. Two special values, eo,ei G [0, 1], cq < ei will be defined so 
that the outputs of all units will be taken to be low if they are in [0, eo]> high if 
they are in [ei, 1] and forbidden otherwise. 

To ensure correct behavior in all cases, we have to ensure correct behavior 
in the worst cases, that is, when low and high values are the farthest possible 
from the ideal values 0 and 1 respectively. 



Conditions for the next-state network: For a next state function of rank m two 
worst cases are possible. Let us study a typical transition Smicr"^, qji, qj 2 i ■■■) Qjm) 
leading at a node t. We want all Xi[t] such that qi G dmio’JT', qj^ , qj^ , qjm) to be 
high whenever the corresponding [t — 1] , 1 < fc < m are high at the children 
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nodes and Uk[t] = 1, and we want all other Xi'\t\ to be low. To ensure Xi[t] > ei 
even in the worst case, when it only receives a single positive contribution from 
units that have the weakest possible high value, ei, one gets: 

5i(i?(ei)™-i?/2) >ei (7) 

Ensuring Xi[t] < cq even in the worst case is more complex. One has to 
first consider each possible configuration of the m state ports (each possible 
configuration of low and high states), then consider which one would be the 
worst case for that configuration, and finally look for the worst configuration. 

Since we want Xi[t] to be low, the worst case in each configuration k is the 
one in which Xi[t] receives the largest possible positive contribution from state 
combinations . . . qi^) not currently present in children but correspon- 
ding to allowed transitions into state qi {qi G Qh , The worst 

possible contribution of each state combination that is not currently present (low 
contribution in the following) is a product of cq for each state not present (at 
least one) and 1 for each state present (that is, each of these contributions ranks 
from eo to eg). For each possible combination {qi^, qi 2 , ■ ■ ■ Qim) high and low 
units in each port in a given port configuration k, the worst case is that all the 
possible low contributions correspond to allowed transitions and thus contribute 
noise to the desired low value of Xi[i\. We will call the sum of all possible low 
contributions of a configuration k the worst low contribution of that configura- 
tion, C(k, nx, m, eg). The worst case for any RNN having m state ports with 
nx units is the maximum worst low contribution for all possible combinations of 
state, C*{nx,rn,eo) = maxk C(k, nx, m, eg). We have not been able to obtain 
this value analytically but instead we have performed an exhaustive search for 
this maximum over all 2"^"* combination^: each configuration k is defined by 
a vector k = (ki, ^ 2 , . . . , km) specifying the number ki G [1, nx] (* G [1, mj) of 
low units in each port and its worst contribution may easily be shown to be 

m m 

(^(k, nx, m, eg) = J]^(nx -h + ~ 

i—1 i—1 

(of course many contributions are equal due to symmetries). The condition for 
a low value of Xi[t] is, then, in the worst case, 

(iJ(C*(nx,rn, eg) - 1/2)) < eg. (9) 

Conditions for the acceptance unit: If the output of the acceptance unit has to 
be high, the worst case occurs when only one state unit is contributing and its 
contribution is the weakest possible. The condition is therefore 

gi(i7(ei-l/2)) >ei. (10) 

^ Analytical upper bounds to H are however easily obtained for the oversimplified 
(impossible) worst case in which all contributions are equal to eg, and there are 
nx — 1 of them; these bounds are inordinately high and therefore of no interest. 
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The worst case for a low value of the output of the acceptance unit is that all 
acceptance state units have the worst low value (eo) and all states but one are 
acceptance states. The condition is therefore 

9L {H{{nx - l)eo - 1/2)) < eo (11) 

Values of H : The values of H we are looking for are the smallest values of H 
compatible with values of eo and such that ei such that 0 < eo < ei < 1 and all 
conditions CD-nni are fulfilled. It happens to be the case that the most stringent 
condition is ( 0 ; the minimum value of H determined on that one may then be 
substituted in CD to obtain the value of ei, because this value will always ensure 
that condition II 1 1 )ll is fulfilled. 

Tabled shows values of H for a set of representative values of M (maximum 
m for a given NFSTA) and nx- The values are very high; the sigmoid RNN works 
almost like a discrete-state RNN because neurons are usually very saturated; the 
reader is however, reminded of the fact that these values of H are (a) derived 
from a worst-case-based study (which considers the maximum possible number 
of transitions); (b) used to scale a particular discrete-state RNN encoding of 
the NFSTA; smaller values of H may therefore still guarantee correct NFSTA 
behavior for many NFSTA. If, instead of using a single scaling parameter H 
specialized parameters were used for each unit (as in |^), weight values would 
get even smaller. 





M = 1 


M = 2 


CO 


M = 4 


nx =2 


7.18 


7.38 


8.47 


9.25 


nx = 3 


8.36 


10.22 


11.93 


14.25 


nx = 4 


9.15 


11.92 


14.52 


17.04 


nx = 5 


9.73 


13.14 


16.36 


19.48 



Table 1. General values of the scaling factor H as a function of the number of state 
units nx and the maximum rank M for NFSTA encoding on a high-order Elman 
recursive neural network. 



5 Conclusion 

We have studied a strategy to encode nondeterministic finite-state tree automata 
(NFSTA) on a high-order sigmoid recursive neural network (RNN) architecture. 
This strategy has been derived from a strategy to encode a NFSTA in a discrete- 
state RNN by first turning it into a sigmoid RNN and then looking for the 
smallest value of the gain H of the sigmoid that ensures correct NFSTA behavior, 
using a worst-case criterion. It has to be noted that the values of H obtained are 
derived from general worst cases that may not occur in general, and therefore 
sigmoid RNN having a smaller gain may still show correct NFSTA behavior. 
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The values of H obtained suggest that, even though in principle RNN with 
finite weights can simulate exactly the behavior of NFSTA, it will in practice be 
very difficult to learn the exact finite-state behavior from examples because of 
the very small gradients present when weights reach adequately large values. 
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Abstract. Efficient and robust information retrieval from large image 
databases is an essential functionality for the reuse, manipulation, and editing of 
multimedia documents. Structural feature indexing is a potential approach to 
efficient shape retrieval from large image databases, but the indexing is 
sensitive to noise, scales of observation, and local shape deformations. It has 
now been confirmed that efficiency of classification and robustness against 
noise and local shape transformations can be improved through the feature 
indexing approach incorporating shape feature generation techniques. Based on 
this approach, an efficient, robust method is presented for retrieval of model 
shapes that have parts similar to the query shape presented to the image 
database. Effectiveness is confirmed through experimental trials with a large 
database of boundary contours, and is validated by systematically designed 
experiments with a large number of synthetic data. 



1 Introduction 

Information retrieval from large image databases is an essential functionality for the 
reuse, manipulation, and editing of multimedia documents. Images have some 
components in terms of representation, such as color, texture, and shape. Shape is an 
essential component, but shape analysis and representation are still difficult research 
subjects in spite of intensive research carried out for decades. In particular, shapes 
observed in natural scenes are often occluded, corrupted by noise, and partially 
visible. It is an important problem to retrieve efficiently model shapes that have parts 
similar to the query shape presented to the image database [1]. Shape retrieval from 
image databases has been studied recently for improving efficiency and robustness 
[2 — 4]. In particular, the problem is intractable when the shape is partially visible. 
Efficiency and robustness are important, but sometimes incompatible criteria for 
performance evaluation. The improvement of robustness implies that the scheme for 
classification and retrieval should tolerate certain types of variations and deformations 
for images. Obviously, it may lead to inefficiency if some brute-force methods are 
employed such as a generate-and-test strategy by generating various images with a 
number of difference parameters. A key to achieving both efficiency and robustness is 
through a compact and well-structured representation of images that tolerate 
variations and deformations. In particular, it has been confirmed that efficiency of 
classification and robustness against noise and local shape transformations can be 
improved at the same time by the feature indexing approach incorporating shape 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 211-220, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 



212 H. Nishida 



feature generation techniques [5]. However, the application of this approach has been 
confined to sets of shapes represented as closed contours. 

In this paper, based on the structural feature indexing with feature generation 
models, an efficient, robust method is presented for retrieval of model shapes that 
have parts similar to the query shape presented to the image database. This paper is 
organized as follows: In Section 2, a structural representation of curves by quasi- 
convex/concave features along with quantized-directional features [5] is outlined. In 
Section 3, based on the shape representation outlined in Section 2, we describe the 
shape signature, the model database organization through feature indexing, and the 
shape retrieval through voting. In Section 4, the transformation rules of shape 
signatures are introduced to generate features that can be extracted from deformed 
patterns caused by noise and local shape deformations. The proposed algorithm is 
summarized in Section 5. In Section 6, the proposed method is validated by 
systematically designed experiments with a large number of synthetic data. Section 7 
concludes this paper. 



2 Shape Representation 



The structural representation of curves [5] is outlined in this section, based on quasi- 
convex/concave structures incorporating 2N quantized-directional features ( is a 
natural number). As shown in Fig. la, the curve is first approximated by a series of 
line segments. On a 2-D plane, we introduce N -axes together with 2N quantized- 
direction codes. For instance, when N = A , eight quantized-directions are defined 
along with the four axes as shown in Fig. lb. Based on these N -axes together with 
2N quantized-direction codes, the analysis is carried out hierarchically. 

A curve is decomposed into sub-segments at extremal points along each of the N - 
axes. Fig. Ic illustrates the decomposition of a contour shown in Fig. la into sub- 
segments when N = A . For adjacent sub-segments a and b, suppose that we turn 
counterclockwise when traversing them from a to b, and the joint of a and b is an 
extremal point along the axes toward the directions (y, j l(mo d2A),...,k).Then, we 

write the concatenation of these two sub-segments as a — ^ b . For instance, the 







Fig. 1. (a) A closed contour with a polygonal approximation, (b) quantized-directional codes 

when N = 4, (c) sub-segments when N = 4, (d) segments when N = 4. 
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joint of sub-segments H and G in Fig. Ic is an extremal point along the three axes 
toward the directions 3, 4, and 5. Therefore, the concatenation of H and G is written 

as H — ^ G . In this way, we obtain the following concatenations for the sub- 
segments illustrated in Fig Ic. 

L^i M, L, K, I-H- J, 

/, G, F—^ G, F^4- E, 

D—^ E, C-H- D, 5^4 C, B, M 

By linking local features around joints of adjacent sub-segments, some sequences 
of the following form can be constructed: 

1(2,0), 1(2, ... f(«.0).y(«,lj (1) 

A part of the contour corresponding to a sequence of this form is called a segment. 
Furthermore, the starting point of the segment is defined as the end point of Qq, and 
the ending point is as the end point of a„ . When a segment is traversed from its 
starting point to its ending point, one turns counterclockwise around any joints of sub- 
segments. The following segments, as shown in Fig. Id, are generated from the 13 
sub- segments shown in Fig. Ic: 

5'i:A^4- M, S2-.A^4- C-H- 0^4- E, 

S^-.F^^- E, S^:E—H- G, G, 

1 -H- j^4 k^4 l^4 m. 

A segment is characterized by a pair of integers (r,d'^ , characteristic numbers, 

representing the angular span of the segment and the direction of the first sub- 
segment: 

r = j{ifi))mod2N + 1>0)- y(hl))mod2iV +2, d = j{l,0) 

1=1 1=1 

The characteristic numbers are given by (2,7^, (7,3^, (2,4^, (3,0^, (4,3^, and 
, respectively, for the six segments shown in Fig. Id. 

Adjacent segments are connected by sharing the first sub-segments or last ones of 
the corresponding sequences. These two types of connection are denoted by S — T 

and S -T , respectively, for two adjacent segments S and T. For instance, connections 
are denoted by Si — 82 - 82 — 8 ^- 8 ^— S(,-Si for the six segments shown in Fig. Id. 
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3 Shape Signature, Indexing, and Voting 

Based on the shape representation outlined in Section 2, we describe the shape 
signature, the model database organization through feature indexing, and the shape 
retrieval through voting. For the model database organization, we assume that each 
model shape is presented to the system as line drawings or boundary contours of 
objects. In the shape retrieval, we assume that line drawings or parts of some model 
shape can be given as a query to the system. 



3.1 Shape Signatures 

In order to retrieve images for a query given as a partially visible shape, the shape 
signature is required to tolerate rotation, scaling, and translation. Therefore, features 
depending on orientation, size, and location cannot be employed as shape signatures. 
Based on the characteristic numbers and connections of segments extracted from 
model shapes or query shapes, the shape signature is constructed to satisfy this 
requirement. We assume that a series of n segments 5, (i have been 

extracted with characteristic numbers ^ and lengths /, . The angular span r; 

does not depend on orientation, size, or location. Furthermore, the lack of information 
due to dropping orientation, size, and location can be compensated by employing a 
triplet of the angular spans of two consecutive segments and their length ratio as the 
shape signature. From two consecutive segments and connected as 

Si c e {h,t } , the shape signature is constructed as follows: 




Fig. 2. Model database organization by structural indexing. Each table item stores a list 
whose element is composed of the model identifier, length, location of the center of gravity, 
and locations of the two end points of the curve segment. 
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Q- 



h'+l 

+ h+\ 



where Q is the number of quantization levels for length-ratio parameters. 



3.2 Indexing 

From each model shape, shape signatures are extracted from all pairs of consecutive 
segments. A large table, as illustrated in Fig. 2, is constructed for a model set by 
assigning a table address to a shape signature and storing there a list whose item is 
composed of the following elements: the model identifier that has the corresponding 
shape signature, and shape parameters of the curve segment corresponding to the 
shape signature, namely length, location of the center of gravity, and locations of the 
two endpoints, computed on the model shape. 



3.3 Voting 

Classification of the query shape is carried out by voting for the transformation space 
associated with each model. For each model, voting boxes are prepared for the 
quantized transformation space (a, Q,Xj,yj) , where ct is the scaling factor, 0 is the 
rotation angle, and {xj ,yj) is the translation vector. Shape signatures are extracted 

from the curve segment given as a query to the shape database. For each extracted 
shape signature, model identifiers and shape parameters are retrieved from the table 
by computing the table address. By comparing the shape parameters of the extracted 
shape signature with the registered parameters, the transformation parameters 
(a, 0, Xj,yj) can be computed for each model and the voting box corresponding to 
the transformation parameters associated with the model is incremented by one. In the 
implementation, transformation parameters a and 0 are computed from the line 
segment connecting the two endpoints, and (xj ,yj) is computed from the location of 
the center of gravity. 



4 Feature Generation Models 

Shape signatures extracted from the curve are sensitive to noise and local shape 
deformations, and therefore, the correct model does not necessarily receive many 
votes as expected for the ideal case. Furthermore, when only one sample pattern is 
available for each class, techniques of statistical or inductive learning from training 
data cannot be employed for obtaining a priori knowledge and feature distributions of 
deformed patterns. To cope with these problems, we analyze the feature 
transformations caused by some particular types of shape deformations, constructing 
feature transformation rules. Based on the rules, we generate segment features that 
can be extracted from deformed patterns caused by noise and local shape 
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deformations. In both processes of model database organization and classification, the 
generated features by the transformation rules are used for structural indexing and 
voting, as well as the features actually extracted from curves. 

The following two types of feature transformations are considered in this work: 

- Change of convex/concave structures caused by perturbations along normal 
directions on the curve and scales of observation, along with transformations of 
characteristic numbers (the angular span of the segment and the direction of the 
first sub-segment). 

- Transformations of characteristic numbers caused by small rotations. 

We describe these two types of transformation in the rest of this section. 



4.1 Transformations of Convex/Concave Structures 





Fig. 3. (a) Part of curves similar to one another in terms of global scales, (b) editing 

stmctural features by merging segment blocks, (c) transformations of characteristic numbers of 
segments by small rotations. 



The convex/concave structures along the curve are changed by noise and local 
deformations, and also depend on scales of observations. For instance, two parts of 
curves shown in Fig. 3 a are similar to one another in terms of global scales, but their 
structural features are different. When = 4 , the curve shown on left is composed of 

three segments connected as — S2 — with characteristic numbers ( 6 , 6 ^, ( 2 , 6 ^, 

and (3,2^ , whereas the one shown on right is composed of five segments connected 

as S[-S'2—S'2-S'4—S'^ with characteristic numbers ( 6 , 6 ), ( 2 , 6 ), ( 2 , 2 ), ( 2 , 6 ) , and 

(3,2) . To cope with such deformations, structural features on the two curves are 

edited so that their features can become similar to one another. For instance, the 
structural features illustrated in Fig. 3 a can be edited by merging the two segment 
blocks , 6'2 , 53 } and } to two segments S and S’ as shown Fig. 

3 b. In the structural indexing and voting processes, for an integer M specifying the 
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maximum number of segments to be merged, shape signatures are generated by 
applying Rule 1 described below to segment blocks. 

Rule 1: Let be the characteristic number of the segment j; , and 

be length of the curve composed of k consecutive segments 
■■■ ■ From a segment block 




\k = 0,1, 



c, 



m,...,m + n-\\s : —s 



where m and n are odd such that l< m,n < M , a shape signature 



k=0 k=0 



Q 






x(,j 



j+l j+m-lj 



^ j+m 



‘•S 



S 



j+m+n- 



ii 



m-l n-l 

is generated if 2 (-i)U,,,>2. X(-1) ^j+m+k - ^ ’ ^j+2k-2 ^j+2k-l + ^j+2k - ^ 

k=0 k=0 

for k = I,..., {m- 1)/2 , and ~ O+m+ 2^^1 + 0 +„+ 2 ,i ^ 2 for 

k = l„..,(«-l)/2. 

For instance, when N = A and M - 3, from the six segments illustrated in Fig. Id 
with characteristic numbers (2,7^ , (7,3^ , (2,4^ , (3,0^ , (4,3^ , and (6,7^ , the 

following shape signatures (length-ratio omitted) are generated by Rule 1: (2, 7, h), (2, 
8, h), (7, 2, t), (7, 3, t), (8, 4, t), (2, 3, h), (2, 5, h), (3, 6, h), (3, 1 1, h), (3, 4, t), (5, 2, t), 

(4, 6, h), (4, 11, h), (6, 2, t), (11, 2, t), (11, 3, t). In total, at most M-|"M/2'f shape 
signatures are generated from n segments. 



4.2 Transformations of Characteristic Numbers by Small Rotations 

The characteristic number ( r > 2 ) can be transformed by rotating the shape. 

Rules can be introduced for generating characteristic numbers by rotating the shape 
slightly (see Fig. 4c). 

Rule 2: When the curve composed of the two consecutive segments and S 2 
with characteristic numbers (^ri,di^ and (^2,(72) is rotated by angle 
v|/ {- n/ N < < n/ n) , the angular spans rj and r 2 can be transformed into one of 

the 9 cases: (1) (ri,r 2 ), (2) (ri,r 2 -l), (3) (ri,r 2 +l), (4) (ri-l,r 2 ), (5) 
(h -1>''2 -1)’ ( 6 ) (t -1>''2 +1)’ ( 2 ) (h +1’''2)> ( 8 ) (h +1’''2 -l)> ( 9 ) (t + '2 + 1 ) ■ 
Note that the cases (4 — 6) are applicable only if rj > 3 , and that the cases (2), (5), and 
(8) are applicable only if T2>3. 

For instance, when N - 4- and M - 3, the 16 shape signatures have been 
generated by Rule 1 from the six segments illustrated in Fig. Id. Then, by applying 
Rule 2 to these generated ones, 120 shape signatures, in total, are further generated. 
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5 Algorithm 

In the model database organization step by structural indexing, shape signatures are 
generated from each model shape by Rules 1 and 2, and the model identifier with the 
shape parameters is appended to the list stored at the table address corresponding to 
each generated shape signature. For each model i (i - let c, be the 

number of shape signatures generated by Rules 1 and 2. For instance, c, = 120 for the 
contour shown in Fig. la when N = A and M = 3 . In the classification and retrieval 
by voting for models and the transformation space, from shape signatures extracted 
from the query shape, shape signatures are generated by Rules 1 and 2. Model 
identifier lists are retrieved from the tables by using the addresses computed from the 
generated shape signatures, and the transformation parameters are computed for each 
model on the lists. The voting box is incremented by one for the model and the 
computed transformation parameters. Let v, (i = l,2,...,n) be the maximum votes 
among the voting boxes associated with the model i . The query shape is classified by 
selecting out some models according to the descendant order of v, / c, . Examples of 
shape retrieval are given in Fig. 4, where query shapes are presented at top along with 
retrieved model shapes. 



6 Experiments 

In this section, the proposed algorithm is evaluated quantitatively in terms of the 
robustness against noise and shape deformations, based on the systematically 
designed, controlled experiments with a large number of synthetic data [5]. We 
examined the probability that the correct model is included in top t% choices for 
various values of the deformation parameter (5 [5] when curves composed of r% 
portions of a model shape are given as queries. For given values of r and p , a sub- 
contour of r% of length is randomly extracted from the model shape, and then, it is 
deformed by the deformation process as described in Nishida [5]. 

The main contribution of this work is to incorporate the shape feature generation 
into the structural indexing for coping with shape deformations and feature 
transformations. Therefore, the performance was compared with Stein-Medioni 
method [2] extracting features from several versions of piecewise linear 
approximations of the curve with a variety of error tolerances for approximations. 

We carried out several experimental trials by changing the number of models from 
200 to 500, examining the classification accuracy in terms of the deformed portions of 
model shapes given as queries to image databases. In the implementation of Stein- 
Medioni method, by changing the error tolerance with Ramer’s method from 1% to 
20%, with a step of 1%, of the widest side of the bounding box of the curve, twenty 
versions of approximations were created for each model shape and the query shape. 
Furthermore, the recommended values specified in [2] were used for some 
engineering parameters. 

Table 1 presents the average classification rates for top 1%, 2%, 3%, 5%, and 10% 
choices when Pe[0.0,0.5], Pe[0.5,1.0], and Pe[l.0,1.5]. For instance, when a 
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curve segment composed of 80% portions of a model shape subject to the deformation 
process with P e [0.5,1.0] is given as a query to shape databases of 500 models, the 
correct models are included in top 15 choices (3%) with probability 98.2% for 
proposed algorithm and with probability 83.9% for Stein-Medioni method. Clearly, 
significant improvements of robustness against noise and local shape deformations 
can be observed for the proposed algorithm in terms of classification accuracy 
without a significant degradation of efficiency. Through the experiments, the 
effectiveness has been verified through the experiments for the shape signature and 
the shape feature generation models. 



7 Conclusion 



Structural feature indexing is a potential approach to efficient shape retrieval from 
large image databases, but the indexing is sensitive to noise, scales of observation, 
and local shape deformations. It has now been confirmed that efficiency of 
classification and robustness against noise and local shape transformations can be 
improved at the same time by the feature indexing approach incorporating shape 
feature generation techniques [5]. In this paper, based on this approach, an efficient, 
robust method has been presented for retrieval of model shapes that have parts similar 
to the query shape presented to the image database. The effectiveness has been 
confirmed by experimental trials with a large database of boundary contours and has 
been validated by systematically designed experiments with a large number of 
synthetic data. 
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Fig. 4. Examples of shape retrieval. 
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Table 1. Average classification rates (%) of deformed patterns by the proposed 
algorithm in terms of the portion of model shapes (r%) presented as queries, in 
comparison with Stein-Medioni method. 



p 


Portion 

(r) 


Method 


Classification rates (%) for top t% choices 


t = 1 


t = 2 


f = 3 


t = 5 


f= 10 


0.0— 

0.5 


100% 


Nishida 


100.0 


100.0 


100.0 


100.0 


100.0 


Stein 


92.8 


96.8 


98.3 


98.8 


99.7 


80% 


Nishida 


99.6 


99.7 


99.7 


99.8 


99.9 


Stein 


86.3 


92.3 


94.8 
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Abstract: We present a practical implementation of a structural matching 
algorithm that uses the generalized deterministic annealing theory. The problem 
is formulated as follows: given a set of model points and object points , find a 
matching algorithm that brings the two sets of points into correspondence. An 
“energy” term represents the distance between the two sets of points. This 
energy has many local minima and the purpose is to escape from these local 
minima and to find the global minimum using the simulated annealing theory. 
We use a windowed implementation and a suitable definition of the energy 
function that reduces the computational effort of this annealing schedule 
without decreasing the solution quality. 

Keywords: generalized deterministic annealing, structural matching, unary and 
binary features, graph matching, local Markov chain. 



1 Theoretical Framework 

The problem of finding a correspondence between object and model features is an 
important part of object recognition. It can be formulated as one of graph matching 
and there are many approaches to solve it such as : probabilistic relaxation, neural 
networks. 

The neural networks methods are not restricted to looking for right isomorphism. 
They use a “energy”, or distance measure between graphs , that is minimized by 
finding a correct correspondence between vertex labels of model graph with respect to 
object graph. 

Our algorithm applies the generalized deterministic annealing framework, 
formulated in [1], to the problem of structural matching between 2D model and 
object features. The problem is to find a correspondence between points of interest 
(high curvature or zero crossing points). If we have N object points and K model 
points, the solution space has cardinality | | . It is a complicated combinatorial 

optimization problem . These problems have often multiple conflicting constraints, 
and have typically numerous local minima in the energy function. 

The method of simulated annealing allows the solution to escape from local 
minima . Thermal noise is added to neural network by simulating high temperatures , 
then temperature is slowly reduced so that the amount of thermal noise decreases. 

Generalized deterministic annealing utilizes N local Markov chains of K states.. 
These are K states neurons , representing the state probability density of local Markov 
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chains. We have N variables that have K possible states. For each variable , we 
assigna K-state neuron. We define a matrix D(i,j,T) whose lines computes the 
distribution for each local Markov chain. We note each line of the matrix with Dn, 
n=0.. N-1.. Generalized deterministic annealing (GDA), iteratively computes the 
distribution of each local Markov chain. For each temperature , after a number of 
iterations, these distributions converge to a state of thermal equilibrium (stationary 
distribution), caracterized by the uniform convergence to within a e- ball. As the 
temperature is lowered, the distributions become singular, all of the N local Markov 
chains are frozen, and the solution is complete. In simulated annealing (SA) theory , 
the probability of a transition from solution i to solution j is computed by : 

P(i,j,T)=G(i,j,T)A(i,j,T) i^j 

P(i,i,T) = l- ZP(i,l,T) (1) 

1 e N(i) 



N(i) is the solution neighborhood, G(i,j) is the generation function and A(i,j,T) is 
the acceptance function . We can assume a uniform generation function 

G(i,j) = ^^, |N(i)| is the cardinality of the solution neighborhood. In our case 



G(I,j)=l/K. The acceptance function can be sigmoidal : 

A(i,j,T) = 



1 



l + exp(Ea)-E(i)/T)’ 

E(i) is the energy with respect to solution i. The distribution of each local 
Markov chain is updated using the formula 

Dfi+l = DfiPn(T) (3) 



The matrix P_, is constructed from the SA transition probabilities defined by (1). 
The acceptance function is defined as: 



l + exp((En(i)-En(i)/T) 

for neuron n, and temperature T. 

E_j(i) is the local energy function with respect to neuron n in the state i, 

I=0..(K-1) 

Using the expression (1) ,(2) and (3), it can be demonstrated [1] that the 
distributions can be updated using the formula: 

, 1 K-1 

Dfi+l(j) = - Z An(iJ-T)(DUi) + Dfia)) (5) 

^ 0 

The local energy function depend on the state of the neighboring neurons. We 
estimate the state of the j neuron as: 

jn-{jlDfi(i)>DUi)-i-O..K-l} 
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In GDA the convergence of the neurons is demonstrated and the final value of the 
temperature , where the elements of have converged to zero or one is 

deterministic. The final temperature is: 

^Emin 



Tf < 



In 



K-Ks-1 

Ks 



(6) 



is the minimum change on the local energy associated with one update, e is 
the radius of the ball to within D‘„ converges. The starting temperature is defined for 
the condition that all solutions have equal probability, and the next update remain 
close to initial solution. Using the expression 

Am„=[l+exp(-AEj^ai/T] ' , it can be demonstrated that the starting temperature for 
the annealing schedule must satisfy the condition: 

^ ErnaX 

To > 



In 



K-1+K^e 



(7) 



K-I-K^e 

As a conclusion ,for high temperatures GDA can “climb” and escape local minima 
, and for low temperature , GDA settles into the best locally minima. 



2 Implementation 

Our object model uses points of high curvature, or zero curvature crossing, as 
points of interest to be matched. The points are identified using three unary features 
and two binary features. The unary features are: 

U 1 curvature of interesting points 
U2 summation of lengths adjacent to points 

U3 the absolute value of the variation of curvature on the segments adjacent to 
interesting points. 

The binary features are: 

-B 1 length of the line connecting two points 

-B2 length along the contour between two interesting points. 

The energy function encodes the compatibility between model and object features. 
For the curvature we have used the following formula : 



k(u) = 



x(u) y(u)-x(u) y(u) 



9 9 3/2 

(x(u)-^ + y(u)-^) 



( 8 ) 



u is the length parameter along the contour. 
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The local energy function is defined by the formula: 

1 kNl 

E(i,j)= I IWB(k2)f(Bk2obj(j,kl)-Bk2mod(i,kl)) + 
k2 = Okl = 0 

2 

+ EWU(k3).f(Uk3obj(i)-Uk3mod(j)) (9) 

k3 = 0 



E(i,j) is the local energy of point i that has the label j. 

K2=0,l addresses the binary features 
K3=0,l,2 addresses the unary features 

WBk2,WUk3 are weights representing the strength of each attribute. 

f is the compatibility function and is expressed as: 
f=tanh(clul), c is a constant. 

1 2 
ZWB(k2)+ ZWU(k3) = l 
k2 = 0 k3 = 0 



The weight coefficients sum is equal to 1, and we take all these weights=0.2 
Bk2 and Uk3 are object and model unary and binary features. 

The purpose of the algorithm is to find the best solution without searching the 
entire solution set, and has the following steps: 

1) Set the temperature T=T0 and Dn(i,j,T)=l/K 

2) Use (4) to update Dn(i,j,T) 

3) If IDn'*'(i,jT)-Dn‘(i,jT)ke for all i,j , then set T=0.9T, else return to 2 

4) If all Dn‘(i,jT) have converged to 0 or 1 then all the neuron have reached 
saturation and the annealing schedule stops, else return to 2. 

The local neighborhood of a point in the formula of the energy function is 
defined by taking into consideration only a few points around the interesting point in 
the formula of the local energy. KNl represent the number of these points , and we 
have taken for our models KNl=7.This local structure reduces the execution time and 
has a good behaviour to partial object occlusion. 

A windowed implementation can reduce the wasteful execution time . For 
every object point we consider a window of model points that have the same sign of 
curvature. Taking into account a standard deviation of 10% of corruptive noise for our 
experiments, we have used a few points window size (Kw=10). 

Another approach for time reduction can be a suitable choice of e. The 
expression (7) shows that when e— >0, TO is very large and the annealing schedule is 
very long. We have tested this algorithm using object models like fig. 1-4. 
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Some experimental results are presented in the table 1. 



Table 1 



Object 

model 


Nr. 

Of 

iterations 


e 


Windowed 
implementation 
time reduction 


1 


137 


0.01 


45% 


2 


162 


0.01 


56% 


3 


141 


0.01 


61% 


4 


152 


0.01 


52% 



3 Conclusion 

The algorithm utilizes the major benefit of the simulated annealing algorithms , 
that is the possibility to escape from local minima of the energy function. A 
windowed implementation , a suitable choice of decreasing temperature and a suitable 
expression for the energy function can provide an important reduction time in the case 
of applications with a large number of variable 
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Abstract. This paper casts the problem of point-set alignment and cor- 
respondence into a unified framework. The utility measure underpinning 
the work is the cross-entropy between probability distributions for align- 
ment and assignment errors. We show how Procrustes alignment parame- 
ters and correspondence probabilities can be located using dual singular 
value decompositions. Experimental results using both synthetic and real 
images are given. 



1 Introduction 

Point pattern matching is a problem of pivotal importance in computer vision 
that continues to attract considerable interest. The problem may be abstracted as 
either alignment or correspondence. Alignment involves explicitly transforming 
the point positions under a predefined geometry so as to maximise a measure of 
correlation. Examples here include Procrustes normalisation |^, affine template 
matching and deformable point models Correspondence, on the other 
hand, involves recovering a consistent arrangement of point assignment labels. 
The correspondence problem can be solved using a variety of point assignment 
[11 ,‘fl1 ,’^j and graph-matching mm algorithms. 

The problem of point pattern matching has attracted sustained interest in 
both the vision and statistics communities for several decades. For instance, Ken- 
dall jSj has generalised the process to projective manifolds using the concept of 
Procrustes distance. Ullman m was one of the first to recognise the importance 
of exploiting rigidity constraints in the correspondence matching of point-sets. 
Recently, several authors have drawn inspiration from Ullman’s ideas in deve- 
loping general purpose correspondence matching algorithms using the Gaussian 
weighted proximity matrix. For instance Scott and Longuet-Higgins m locate 
correspondences by finding a singular value decomposition of the inter-image 
proximity matrix. Shapiro and Brady on the other hand, match by compa- 
ring the modal eigenstructure of the intra-image proximity matrix. In fact these 
two ideas provide some of the basic groundwork on which the deformable shape 
models of Cootes et al 0 and Sclaroff and Pentland m build. This work on the 
co-ordinate proximity matrix is closely akin to that of Umeyama HZl who shows 
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how point-sets abstracted in a structural manner using weighted adjacency gra- 
phs can be matched using an eigen-decomposition method. These ideas have 
been extended to accommodate parametererised transformations m which can 
be applied to the matching of articulated objects m- More recently, there have 
been several attempts at modelling the structural deformation of point-sets. For 
instance, Amit and Kong Q have used a graph-based representation (graphical 
templates) to model deforming two-dimensional shapes in medical images. La- 
des et al Pj have used a dynamic mesh to model intensity-based appearance in 
images. 

In a recent paper we developed a unified statistical framework for alignment 
and correspondence |3]. The motivation for the work was that the dichotomy 
normally drawn between the two processes overlooks considerable scope for syn- 
ergistic interchange of information. In other words, there must always be bounds 
on alignment before correspondence analysis can be attempted, and vice versa. 
Our approach in developing the new point-pattern matching method was to em- 
bed constraints on the spatial arrangement of correspondences within an EM 
algorithm for alignment parameter recovery. This process has many features re- 
miniscent of Jordan and Jacob’s hierarchical mixture of experts algorithm Pj. 
The observation underpinning this paper is that although the method proved ef- 
fective it fails to put the alignment and correspondence processes on a symmetric 
footing. The relational constraints were simply used to gate the contributions to 
the log-likelihood function for the alignment errors. 

The idea underpinning this paper is to provide a new framework for the 
maximum likelihood matching of point-sets which allows a symmetric linkage 
between alignment and correspondence. Specifically, we aim to realise interlea- 
ved iterative steps which communicate via an integrated utility measure. The 
utility measure is the cross-entropy between the probability distributions for ali- 
gnment and correspondence. By casting the cross-entropy in terms of matrices, 
we realise optimisation via dual singular value decompositions. The first of these 
transforms the point set positions so as to locate an alignment that maximises 
the weighted correlation between the point-set co-ordinates. The second singular 
value decomposition updates the set of correspondence probabilities that maxi- 
mise the weighted correlation between the edge-sets of the adjacency graphs for 
the point-sets. These processes are interleaved and iterated to convergence. 

2 Point-Sets 

Our goal is to recover the Procrustes normalisation that best aligns a set of 
image feature points w with their counterparts in a model z. In order to do this, 
we represent each point in the image data set by a co-ordinate position vector 
Wi = {xi,yi)'^ where i is the point index. In the interests of brevity we will 
denote the entire set of image points by w = {tUj, Vi G V} where V is the point 
index-set. The corresponding fiducial points constituting the model are similarly 
represented hy z = {zj,\/j G Ai} where Ai denotes the index-set for the model 
feature-points Zj. 
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Later on we will show how the two point-sets can be aligned using singu- 
lar value decomposition. In order to establish the required matrix represen- 
tation of the alignment process, we construct two co-ordinate matrices from 
the point position vectors. The data-points are represented by the matrix D — 
( Wi W 2 .... 'it’iDi ) whose columns are the co-ordinate position vectors. The 
corresponding point-position matrix for the model is M = ( jzi Z 2 .... z\m\ )• 

One of our goals in this paper is to exploit structural constraints to improve 
the recovery of alignment parameters from sets of feature points. To this end we 
represent point adjacency using a neighbourhood graph. There are many alterna- 
tives including the N-nearest neighbour graph, the Delaunay graph, the Gabriel 
graph and the relative neighbourhood graph. Because of its well documented 
robustness to noise and change of viewpoint, we adopt the Delaunay triangula- 
tion as our basic representation of image structure 0. We establish Delaunay 
triangulations on the data and the model, by seeding Voronoi tessellations from 
the feature-points. 

The process of Delaunay triangulation generates relational graphs from the 
two sets of point-features. More formally, the point-sets are the nodes of a data 
graph Gd = {T^,Ed} and a model graph Gm = Here Ed Q T> x V 

and Em ^ A4 x A4 are the edge-sets of the data and model graphs. Later on 
we will cast our optimisation process into a matrix representation. Here we use 
the notation to represent the elements of the adjacency matrix for the 

data graph; the elements are unity ii i = i' or if is an edge and are zero 

otherwise. We represent the state of correspondence between the two graph using 
the function f : T> ^ A4 from the nodes of the model graph onto the nodes of 
the data-graph. 



3 Dual Step Matching Algorithm 

We characterise the matching problem in terms of separate probability distribu- 
tions for alignment and correspondence. In the case of alignment, the distribution 
models the registration errors between the data point positions and their counter- 
parts in the model under Procrustes alignment. The correspondence process on 
the other hand captures the consistency of the pattern of matching assignments 
to the graph representing the point-sets. The set of assignments is represented by 
the function f : T> ^ M. Suppose that PE' is the probability that node i from 
the data graph is in alignment with node j from the model graph at iteration 
n. Similarly, is the probability that node i is in correspondence with node 

j. Further suppose that = p(w^"^|zj) is the probability distribution for the 
alignment error between the nodes i and j under the Procrustes alignment at 
iteration n. The distribution of the correspondence errors associated with the 
assignment function at iteration n is ■ With these ingredients the utility 
measure which we aim to maximise in the dual alignment and correspondence 
step is 
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£ = 



EE 

iev jeM 






(n+l) 

'i'J 



+ Pi]’ Ing^ 



(n+1) 



( 1 ) 



In other words, the two processes interact via a symmetric expected log- 
likelihood function. The correspondence probabilities weight contributions to 
the expected log-likelihood function for the alignment errors, and vice-versa. In 
our previous work, we showed how the first term arises through the gating of 
the log-likelihood function of the EM algorithm P| . 

The alignment point positions and correspondence matches are recovered via 
the dual maximisation equations 



and 



= argmax^ ^ Qg lnpg+'^ 
iev jeM 



( 2 ) 






l+l) _ 



= arg max 



EE 

iev jeM 






( 3 ) 



3.1 Alignment 

To develop a useful alignment algorithm we require a model for the measure- 
ment process. Here we assume that the observed position vectors, i.e. Wi are 
derived from the model points through a Gaussian error process. Suppose that 
the revised estimates of the data-point position matrix is w’f'\ According 
to our Gaussian model of the alignment errors. 









( 4 ) 



where S is the variance-covariance matrix for the point measurement errors. 
Here we assume that the position errors are isotropic, in other words the errors 
in the x and y directions are identical and uncorrelated. As a result we write 
E = cr ^/2 where I 2 is the 2x2 identity matrix. With this model, maximisation 
of the expected log-likelihood function £a reduces to minimising the weighted 
square error measure 



= H E - »*"+'*) (5) 

ieVjeM 

We would like to recover the maximum likelihood alignment parameters by 
applying an rigid transformation to the two point-sets. We recover the requi- 
red parameter matrix by performing singular value decomposition of a point- 
correspondence matrix. In order to develop the necessary formalism, we rewrite 
the weighted squared error criterion using a matrix representation. Suppose that 
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is the data-responsibility matrix whose elements are the a posteriori cor- 
respondence probabilities ■ With this notation the quantity iFa can be ex- 
pressed in the following matrix form 



Ta = - 2Tr -k (6) 

Since the first and third terms of this expression do not depend on the alig- 
nment of the point-sets, the transformation matrix satisfies the condition 



D("+i) = argmaxTr[i)g(")M'^] (7) 

b 



The Procrustes alignment of the points can be thought of as maximising a weigh- 
ted measure of overlap or correlation between the point-sets. 

The required maximisation can be performed using a singular value decom- 
position. The procedure is as follows. The matrix is factorised into a 

product of three new matrices U, V and A, where Z\ is a diagonal matrix whose 
elements are either zero or positive, and U and V are orthogonal matrices. The 
factorisation is as follows = UAV'^. 

The matrices U and V define a rotation matrix 0 which aligns the principal 
component directions of the point-sets M and D. The rotation matrix is equal 
to 6> = VU^. 



With the rotation matrix to hand we can find the Procrustes alignment 
which maximises the correlation of the two point sets. The procedure is to first 
bring the centroids of the two point-sets into correspondence. Next the data 
points are scaled to that they have the same variance as those of the model. 
Finally, the scaled and translated point-sets are rotated so that their correlation 



is maximised. 



To be more formal the centroids of the two point-sets are = E{w^A’) 
and p-M = E{zj). The corresponding covariance matrices are — 

- b'oY) and Em = E{{zj - )^)- 

With ingredients the update equation for re-aligning the data-points is 






w\ ’ =PM+ vu {wy-p'-jj') (8) 

Finally, we update the a posteriori matching probabilities by substituting the 
revised position vector into the conditional measurement distribution. Using the 
Bayes rule, we can re-write the a posteriori measurement probabilities in terms 
of the components of the corresponding conditional measurement densities 



p(") 

i,j 









(9) 



It is worth pausing to consider the relationship between the point correlation 
measure developed in this paper and those exploited elsewhere in the literature 
on point pattern matching. The quantity is simply the standard measure 
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of overlap that is minimised in the work on least-squares alignment jl 8| . The 
matrix Q, on the other hand, plays the role of the correspondence matrix used 
by Scott and Longuet-Higgins m- So, the utility measure delivered by the cross 
entropy plays a synergistic role with the correspondence point proximity matrix 
weights the least-squares criterion. 



3.2 Correspondences 

The correspondences are recovered via maximisation of the quantity 

= (10) 

iev jeM 



Suppose that Vd(z) = G Ed} represents the set of nodes connected 

to the node i by an edge in the graph with edge-set Ed- Furthermore, let us 
introduce a set of assignment variables that convey the following meaning 



= I 1 

1 0 otherwise 



(11) 



In a recent study jSj , we have shown that the probability distribution for the 
assignment variables is 



=ifexp 



ka 



E E (1- 

'6VD(di'6VM(j) 






( 12 ) 



where K and ke are constants. With this distribution to hand, the correspon- 
dence assignment step reduces to one of maximising the quantity 

•^- = E E E E (13) 

iGU i^M i'eT> j'^M 



where ED{i,i') and EM{j,j') are the elements of the adjacency matrices for 
the data and model graphs. In more compact notation, the updated matrix of 
correspondence indicators satisfies the condition 

5'("+i) =argmaxTr[£;Dp(”).BM5'^] (14) 

where is a matrix whose elements are the alignment probability P("^. In 
other words, the utility measure gauges the degree of correlation between the 
edge-sets of the two graphs under the permutation structure induced by the ali- 
gnment and correspondence probabilities. Following Scott and Longuet-Higgins 
m we recover the matrix of assignment variables that maximises by per- 
forming the singular value decomposition Em = VAU'^, where Z\ is 

again a diagonal matrix and U and V are orthogonal matrices. The matrices U 
and V are used to compute an assignment matrix = VU'^ . To compute 
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the associated matrix of correspondence probabilities, we perform row 

normalisation on As a result 



Q 



(n) 

i,3 



d("+1) 



E 



p("+l) 
j&M ^i,3 



(15) 



This is clearly simplistic and violates symmetry. In our future work we plan 
to improve the algorithm to include Sinckhorn normalisation and slack variables 
for unmatchable nodes along the lines of Gold and Rangarajan [^. 



4 Experiments 

In this section, we provide some experimental evaluation of the new unified 
approach to correspondence and alignment. Here, we use both synthetic point- 
sets and real images. 



4.1 Sensitivity Study 

To evaluate the robustness of the new approach, we furnish a sensitivity study. 
This compares the new iterative alignment method with the following three 
alternatives: 

— The first method (Refered to as ” Weight -kSVD”) is similar to that of Scott 
and Longuet-Higgins. This performs the singular value decomposition = 

on the initial inter-image weight matrix. Suppose that As is the 
matrix obtained by setting the diagonal elements of As to unity, then the 
Scott and Longuet-Higgins algorithm delivers an updated matrix of corre- 
spondence weights W = UsAsVg . The updated weight matrix can be used 
to align the point-sets using the method outlined earlier in this paper. 

— The second algorithm(Refered to as ’’Single SVD”) performs the singular 

value decomposition = UAV'^ to find the rotation matrix 0 = VU'^ 

that maximises the unweighted point correlation Tr[DM'^]. 

— The third method(Refered to as ”PCA”) is based upon aligning and scaling 
in the principal component axes of the two point-sets. 

Figure El shows the RMS error as a function of the standard deviation of the 
point position error. The main point to note from this plot is that for all four 
algorithms the RMS error increases linearly with the noise standard deviation. 
However, for the new algorithm (EM-1- Weight -l-SVD-shown as circle points), 
the rate of increase of the RMS error is much lower than the remaining three 
algorithms. In other words, the new algorithm gives more accurate alignments. 

Figure shows the fraction of points in correct correspondence as a function 
of the fraction of added clutter. The main point to note for this plot is that 
the new method (EM-1- Weight -l-SVD - shown as circles) is considerably more 
accurate in locating correspondences. Moreover, the two SVD-based methods 
perform only marginally better than the PC A alignment. 
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Fig. 1. Sensitivity study. Left: Alignment error as a function of noise- variance on the 
point-sets. Right: Alignment error as a function of the fraction of structural error 

We have evaluated the noise sensitivity of the algorithm on synthetic point- 
sets and compared it with the quadratic assignment method of Gold and Ranga- 
rajan CH and the Scott and Longuet-Higgins algorithm. The point sets have been 
subjected to affine deformation, random measurement error (positional jitter) 
and contamination from added clutter. 

Figure 0 shows the fraction of correct correspondences as a function of the 
fraction of added clutter. The method outperforms that of the quadratic as- 
signment algorithm and the Scott and Longuet-Higgins algorithm. The onset of 
significant correspondence error occurs when the fraction of clutter exceeds 30%. 




Fig. 2. Positive correspondence rate as a function of percentage of clutters for rigid 
transformed point set with clutters. 



4.2 Real-World Data 

We have evaluated the algorithm on matching corners detected in real-world 
images. The corner detector used in our studies is described in the recent paper 
by by Luo, Cross and Hancock [HI. We use Delaunay graphs to represent the 
structural arrangement of the corners. Figure H shows the correspondences bet- 
ween the corners as lines between the two images. After checking by hand, the 
fraction of correct correspondences is 77%. Figure 0 shows the iterations of the 
alignment process. The process converges after 10 iterations and the alignment 
is qualitatively good. 




Fig. 5. Alignment results. LeftiOverlayed original images. Middle: 1 Iteration. Right: 
10 Iterations 



5 Conclusions 



In conclusion, we have shown how the processes of point-set alignment and cor- 
respondence analysis can be unified using a symmetric entropy. By drawing on a 
Gaussian model of point position errors and an exponential model of correspon- 
dence assignment errors, we are able to cast the two problems as maximisation of 
weighted correlation measures. In both cases the point matches can be recovered 
using singular value decomposition. Our new measures of point-set similarity na- 
turally combine the ideas already developed by Scott and Longuet-Higgins and 
Umeyama in a single statistical utility measure. An experimental study reveals 
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that the proposed method is outperforms that of Scott and Longuet-Higgins in 
terms of its ability to recover from contaminating clutter and positional error in 
the point-sets. 
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Abstract. Graph matching is an important class of methods in pat- 
tern recognition. Typically, a graph representing an unknown pattern is 
matched with a database of models. If the database of model graphs is 
large, an additional factor in induced into the overall complexity of the 
matching process. Various techniques for reducing the influence of this 
additional factor have been described in the literature. In this paper we 
propose to extract simple features from a graph and use them to eli- 
minate candidate graphs from the database. The most powerful set of 
features and a decision tree useful for candidate elimination are found 
by means of the C4.5 algorithm, which was originally proposed for in- 
ductive learning of classification rules. Experimental results are reported 
demonstrating that efficient candidate elimination can be achieved by 
the proposed procedure. 

Key words. Structural pattern recognition, graph matching, graph iso- 
morphism, database retrieval, database indexing, machine learning, C4.5 



1 Introduction 

In structural pattern recognition, graphs play an important role for pattern re- 
presentation. Typically, objects or parts of objects are represented by means of 
nodes, while information about relations between different objects or parts of 
objects is captured through edges. Thus structural relationships can be repre- 
sented in an explicit way. When graphs are used for pattern representation, the 
recognition problems turn into the task of graph matching. I.e., a graph extrac- 
ted from an unknown input object is matched to a database of model graphs to 
recognize or classify the unknown input Application examples of graph 

matching include character recognition m schematic diagram interpretation 
116171 ■ shape analysis 0 and 3-D object recognition |^. 

Structural pattern recognition by means of graph matching is attractive be- 
cause graphs are a universal representation formalism. But on the other hand, 
graph matching is expensive from the computational complexity point of view. 
Matching two graphs with each other requires time and space exponential in 
the number of nodes involved. An additional problem arises if the database of 
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model graphs is large because straightforward, sequential comparison of the un- 
known input with each graph in the database results in an additional factor in 
the complexity, which is proportional to the database’s size. 

Various indexing mechanisms have been proposed to reduce the complexity 
of graph matching in case of large databases IbliaiOllll . In this paper we pro- 
pose a new approach based on machine learning techniques. For the purpose 
of simplicity, only the problem of graph isomorphism detection is considered. 
The main idea of the proposed approach is to use simple features, which can be 
efficiently extracted from a graph, to reduce the number of possible candidates 
in the database. Examples of such features are the number of nodes or edges 
in a graph, the number of nodes or edges with a certain label, the number of 
edges (with a certain label 1) incident to a node (with a certain label I'), a.s.o. 
Obviously, a necessary condition for a graph in the database being isomorphic 
to the input graph is that these features have identical values in both graphs. 
Therefore, it can be expected that certain graphs in the database can be ruled 
out by a few fast tests using only these simple features. Consequently, the num- 
ber of candidates that have to undergo an expensive test for isomorphism can 
be reduced. 

A potential problem with this approach is that there may exist a large number 
of simple features. Hence the questions arises which of these features are most 
suitable to rule out as many candidates from the database as quickly as possible. 
To find the best feature set, we propose the application of machine learning 
techniques in this paper. In particular, the well-known C4.5 algorithm will be 
used in the context of this paper ^ . 

In the next section, a brief introduction to C4.5 will be provided. Then the 
application of C4.5 to reducing the number of candidate graphs in a database will 
be described. Experimental results will be presented in Section^ and conclusions 
drawn in Section 0 

2 Introduction to C4.5 

C4.5 is a program that generates decision trees El- The reason for selecting 
C4.5 for this work is because it is one of the best known programs of this kind. 
C4.5 is based on the divide and conquer paradigm described in the following. 
Let S be the set of training instances and let the classes be Ci, C 2 , C 3 , . . . , (7„. 
There are 3 cases: 

— S contains one or more instances which all belong to a class Cj . The decision 
tree for S' is a leaf identifying Cj . 

— S is empty. The decision tree for S is a leaf, but the class associated with 
the leaf must be determined from information other than S. For example, 
the class is chosen based on background information of the domain such as 
the overall majority class. 

— S contains instances that belong to a mixture of classes. In this case, S is 
refined into subsets of instances that are or seem to be heading towards a 
single-class collection of cases. A test T is chosen, based on a single attribute. 
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that has one or more mutually exclusive outcomes Oi, O 2 , O 3 , ... , Oi and 
S is partitioned into subsets Si, S 2 , S 3 , ... , Si where Sj contains all the 
instances in S that have outcome Oj in the chosen set. The decision tree 
for S now consists of a decision node identifying the test and one branch 
for each possible outcome. The same process is applied to each subset of the 
training cases so that the j-th branch leads to a decision tree constructed 
from the subset Sj of the training instances. 

In the particular application considered in this paper, each graph in the 
database corresponds to exactly one class Ci, and S = Ci U C 2 U . . . U C„ is 
the set of prototype graphs in the database. 

Partitioning at each level in the decision tree is achieved by finding the attri- 
bute that maximizes the information gained from choosing that attribute. C4.5 
uses a normalized gain criterion to select which attribute should be used. The 
definition of the gain ratio used by C4.5 is defined below: 



gain-ratio(X) 



gain(X) 

split-info(X) 



where 



split.info(X)=y:|lxl„g,(M) 

gain(X) = info(T) - info 3 ,(T) , 



info(T) = ^ 
i=i 



freq(Cj,r) 

\T\ 



X log2 



( freq(Cj,r) 

V 



infOa;(r) 



E 



\n 

\T\ 



X info(Ti) 



and T is the set of instances, split into I outcomes based on the test X with Cj 
being some random class. The frequency ratio - freq(C'j, T)/|T| - represents the 
probability that given the case in which a training instance is taken at random 
from the training set T, that training instance belongs to class Cj. 

Attribute values can be discrete or real-valued and C4.5 can handle cases 
involving unknown or imprecise values. C4.5 always attempts to develop the 
simplest decision tree that is representative of the data and when necessary, it 
can apply sophisticated pruning techniques to reduce the size of the tree. For 
further detail see ini. 
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3 Graph Candidate Elimination Using C4.5 

Given an input graph g and a database with n model graphs gi, g 2 , ■ ■ ■ , gn where 
each gi represents a pattern class Ci, we consider the problem of finding a graph 
gj in the database that is isomorphic to g. Formally, g and gj are isomorphic 
if there exists a bijective mapping / from the nodes of g to the nodes of gj 
such that the structure of the edges and all node and edge labels are preserved 
under /. As described in the introduction, we are interested in a procedure that 
efficiently reduces the number of potential candidate graphs in the database. 

Our method for graph candidate elimination has three stages. In the first 
stage we extract the features from the graphs. There are many features that can 
be potentially used, ranging from simple features such as the number of vertices 
or edges per graph to complex ones such as chains of vertices and edges with 
given labels. For our research we have selected the three types of features listed 
below: 

— number of vertices with a given label {feature-type 1 ) . 

— number of incoming edges per vertex {feature-type 2). 

— number of outgoing edges per vertex {feature-type 3). 

The reason for selecting these features is that they are easy and computatio- 
nally inexpensive to extract. For a graph with m vertices, the extraction takes 
only 0{rri^) time and space. On the other hand, as will be shown later, these 
features are very efficient in ruling out potential candidate graphs. For example, 
if the input graph has three vertices of a given label A, then the search for iso- 
morphic graphs can be restricted only to those graphs in the database that have 
exactly three vertices with the label A. Once the features have been extracted, 
they are passed on to C4.5. 

The second stage involves the use of C4.5 to build the decision tree based on 
any combination of the three types of features mentioned above. 

The third stage is to determine which graphs can possibly match a given input 
graph. Obviously, a necessary condition for input graph g being isomorphic to 
model graph gj is that all features extracted from g have values identical to the 
corresponding features extracted from gj. The processing is as follows. First, 
we extract from the input graph the same features that were extracted from the 
graphs in the database. Then we use the values of the features extracted from the 
input graph to traverse the decision tree. There are only two possible outcomes. 
The first outcome is that we reach a leaf node. In this case the graphs associated 
with the leaf node are possible matches to the input graph. Each of these graphs 
is tested against the input for graph isomorphism using a conventional algorithm, 
for example the one reported in m The second outcome is that we do not 
reach a leaf node. In this case there are no graphs in the database that can be 
isomorphic to the input graph. 

The cost to extract any of the features mentioned above from a graph with 
m vertices is 0{rri^). The cost to traverse a decision tree of depth k is 0{k) 
while the cost to match a given input graph against a set of c candidate graphs 
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is 0(c • m™). Hence, for a database consisting of n graphs of size m each, the 
cost to find a match for an input graph is [0(c- to"*) + 0{m?) + 0{k)] given that 
the decision tree has k levels and the maximum number of graphs associated 
with a leaf in the decision tree is c. The graphs associated with a leaf node in 
the decision tree will be called a cluster in the following. We observe that the 
expression 



0(c • to”*) + O(to^) + 0(/c) (1) 

has to be compared to 

0{n ■ TO*") (2) 

which is the computational complexity of the straightforward approach, where 
we match the input graph to each element in the database. Because c < n or 
c << n, a significant speedup over the straightforward approach can be expected. 

4 Experimental Results 

The two parameters that are related to decision trees and have a significant 
impact on the overall efficiency of the method are the cluster size c and the depth 
of the decision tree k. We have investigated the two parameters by conducting 
a number of experiments on a database of randomly generated graphs. The 
parameters used in the graph generation process were the number of vertices, 
the number of vertices, the number of edges per graph and the number of graphs 
in the database. 

As the influence of edge labels can be expected similar to the influence of 
node labels, no edge labels were used. The experiments were conducted as fol- 
lows. First a database of random graphs was generated. Then the features were 
extracted from the graphs and passed on to C4.5 which constructed a decision 
tree to classify the graphs in the database. In the final step of the experiment, 
the decision tree was analyzed and the average cluster and largest cluster size 
recorded along with the depth of the decision tree. 

In the first set of experiments we focused on the first type of feature extracted 
from the graphs, i.e. the frequency of the various vertex labels in the graphs. 
Therefore the decision tree classifying the graphs in the database was built using 
information from only this type of feature. To extract the feature we used a 
simple histogram that recorded the occurrence of each given vertex label. We 
started with only 5 vertex labels and then gradually increased the number of 
labels to 100. We also varied the number of vertices per graph from 5 to 50. At 
the same time, the number of edges was varied from 8 to 480. The number of 
graphs in the database was 1000 in this experiment. All graphs in the database 
had the same number of nodes and edges. Figure Q shows the average cluster 
size. The results indicate that as the number of vertices increases, the average 
cluster size decreases. In the majority of the experiments, the average cluster 
size converged to the value of 2 which means that in general only 2 graphs out 
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Fig. 1. (left) Using the vertex feature (1000 graphs). 

Fig. 2. (right) Using the incoming/outgoing edge features (1000 graphs). 



of a 1000 will be needed to conduct a test for graph isomorphism. The fact that 
the size of the cluster decreases as the number of vertices or labels increases 
was expected since the greater the number of vertices/labels per graph is, the 
more easily identifiable a graph becomes. Hence C4.5 is able to develop better 
classification trees with a smaller cluster size. 

The second set of experiments were done using the frequency of edges inci- 
dent per vertex. Both incoming and outgoing edges were considered in separate 
experiments. Again, the number of vertices was increased from 5 to 50. At the 
same time, the number of edges was varied from 8 to 480. The database’s size 
was 1000. No label information was used in these experiments. Figure 0 shows 
the average cluster size for both the case in which the incoming and outgoing 
edges were used. The results are similar to those obtained from the case in which 
feature-type 1 was used, in the sense that the greater the number of vertices, 
the smaller the cluster size. 

In the third set of experiments, we used pairs of features to build the decision 
tree classifying the graphs in the database. Three pairs of features were used: 
{feature-type 1, feature-type 2), {feature-type 1, feature-type 3) and {feature-type 
2, feature-type 3). We used the same parameters as for feature-type 1: 1000 
graphs per database, 5-50 nodes per graph with all graphs in the database having 
the same size, and 5-100 vertex labels. The results obtained from combining 
feature-type 1 and feature-type 2 are shown in Fig. 0 The average cluster size 
has value of 2 for all cases except when the graphs were of size 5 (the average 
cluster for this case was 3). These results are better than those in the case in 
which any of the three single features were used for classification. The reason for 
this is that by using two features more information about the graphs becomes 
available, and C4.5 was able to classify the graphs with more accuracy. The 
results for the last 2 pairs of features {feature-type 1 ,feature-type 3) and {feature- 
type 2, feature-type 3) were very similar. In both cases the average cluster size 
was 2 and in both cases the results are better than in the case where only a 
single feature was used to classify the graphs. In the fourth set of experiments, 
we combined all three types of features. The results obtained are shown in Fig.0 
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Fig. 3. (left) Using the feature combination 1-2 (1000 graphs). 
Fig. 4. (right) Using the feature combination 1-2-3 (1000 graphs). 



Note that this time the average cluster size had a constant value of 2 regardless 
of the size of the graphs. Therefore the results are better than those obtained for 
the cases in which single features or pairs of features were used by C4.5 to build 
the decision tree. This result matches our expectation since more information is 
available and this leads to more accurate decision trees being built. 

For each of the experiments described previously, we not only recorded the 
cluster size but also the depth of the decision tree. When feature-type 1 was 
used, the depth of the decision tree varied as shown in Fig.El In this case, when 
the number of vertices is small (5 vertices per graph) and the number of vertex 
labels is high (100 labels), the depth of the decision tree is large (over 40). As the 
number of vertices increase and the number of vertex labels decrease, the depth 
of the decision tree decreases until it finally converges to an average depth of 
11. When feature-type 2 and feature-type 3 were used, the depth of the decision 
trees was more constant and not influenced by the number of vertices in the 
graphs (see Fig. 0). The average decision tree depth is 11, which is the same as 
in the case of feature-type 1. 



2 « 


n ^ ^ 1 1 — 


2 'to 


^ ^ 1 1 1 


35 


\ 5 labels 


35 


Incoming edges feature 


0 30 


^ \ 10 labels 

30 labels 


0 30 


outgoing edges feature . 


» 25 


50 labels " 


“ 25 


- 


S 20 


- . ■■■■•.. 70 labels 


S 20 


- 


“ 15 


100 labels 


° 15 


- 


0 10 


" 


0 10 




5 5 
0 0 


1 1 1 1 1 


5 5 
0 0 


1 1 1 1 1 



0 10 20 



30 



40 



50 



Number of Vertices 



10 20 30 40 50 

Number of Vertices 



Fig. 5. (left) Using the vertex feature (1000 graphs). 

Fig. 6. (right) Using the incoming/outgoing edge features (1000 graphs). 
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The results obtained from using pairs of features are shown in Figs. |7|[3 and 
0 In all three cases there is a substantial improvement when compared with 
results obtained using feature-type 1. The depth of the tree converges faster 
to an average value of 11 and in the worst case (small number of vertices, high 
number of labels) the decision tree depth is only 25. Therefore combining features 
together not only helps in reducing the average cluster size but also helps reduce 
the depth of the tree. The results also show that there were no substantial 
differences between the three pairs of features. 

More improvement can be obtained by using more types of features. This is 
indicated by the results produced when using feature-types 1-3 together (see 
Fig. 1 1 1)11 . In this case the depth of the tree converges faster to an average depth 
of 11. Also when more features are used, the depth of the tree in the worst case 
scenario (small number of vertices, high number of labels) is less than 25. This 
is a reduction of almost 50 when one considers the case of feature-type 1. 

To give some concrete hints about the computation time that can be saved 
by the method proposed in this paper, we measured the quantities in eqs. (0) 
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and 0 on a set of graphs with 20 vertices each. On a standard workstation, it 
takes about 20s to match an input graph with a database of 1000 graphs using an 
implementation of Ullman’s algorithm m- Traversing a decision tree of depth 
11 takes 0.3s, and feature extraction takes 0.08s. Thus if we neglect the time 
needed off-line for decision tree construction, only 0.4s are needed by our method 
as compared to 20s for a full search. Because of the exponential complexity of 
graph matching, a higher speedup can be expected for larger graphs. 



5 Conclusions 

In this paper we have presented a novel approach to graph matching that involves 
the use of decision trees. The method offers the advantages that it significantly 
reduces the number of candidate graphs in the database and it requires features 
that are easy to extract. It involves building a decision tree that classifies the 
graphs along selected features and, in order to find a match for a given graph one 
simply needs to go down the decision tree. In the best case scenario the leaf node 
(cluster) of the decision tree will contain just one graph while in the worst case 
it will contain several graphs. The complexity of the method is determined by 3 
parameters: the size of the cluster associated with the leaf node in the decision 
tree, the size of the graph and the depth of the decision tree. 

We have conducted numerous experiments to investigate the average cluster 
size at the leaf node and the depth of the decision tree. We used a database size 
of 1000 graphs. The graphs in the database had between 5 and 50 vertices and 
the set of vertex labels varied between 5 and 100 labels. The results indicate 
that the average cluster size is directly affected by the number of vertices and 
vertex labels in the graphs. The higher the number of vertices per graph, the 
smaller is the average cluster size. Also the higher the number of vertex labels, 
the smaller is the average cluster size. Another way of keeping the size of the 
clusters minimal is to use combinations of features. Combinations of two and 
three features generally produce increasingly better results. 

The method proposed in this paper can lead to substantial savings in com- 
putation time when matching a graph to a database of models. Both feature 
extraction and decision tree traversal are very fast. At the same time they are 
suitable for eliminating a large portion of candidates from the database, leaving 
only a few graphs that have to undergo expensive isomorphism test. 

The method presented in this paper is suitable to search for graph isomor- 
phisms. Future work will investigate if this method can be applied to the sub- 
graph isomorphism and the approximate graph matching problems. Future work 
could also involve the use of more complex features for classification. The fea- 
tures used for the work described in this paper are simple and easy to extract. 
However, there are other types of features such as vertex-edge chains that could 
be used to build the decision trees. Such features are more complex and harder to 
extract from graphs, but they offer the benefit that they provide a more unique 
description of each graph in the database. 
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Abstract. This paper presents work aimed at rendering the dual-step 
EM algorithm of Cross and Hancock more efficient. The original algo- 
rithm integrates the processes of point-set alignment and correspon- 
dence. The consistency of the pattern of correspondence matches on 
the Delaunay triangulation of the points is used to gate contributions 
to the expected log-likelihood function for point-set alignment parame- 
ters. However, in its original form the algorithm uses a dictionary of 
structure-preserving mappings to asses the consistency of match. This 
proves to be a serious computational bottleneck. In this paper, we show 
how graph edit-distance can be used to compute the correspondence pro- 
babilities more efficiently. In a sensitivity analysis, we show that the edit 
distance method is not only more efficient, it is also more accurate than 
the dictionary-based method. 



1 Introduction 

The matching of point-sets is a problem of central importance in computer vi- 
sion, The process is usually abstracted as either alignment or correspondence. 
Alignment is concerned with recovering the set of transformation parameters 
that bring the points into registration with one-another prm^ . Correspondence 
is a symbolic process which is concerned with consistently labelling the points 
|4I6| . Alignment can be realised using maximum likelihood methods while cor- 
respondence is frequently posed as a graph-matching problem. 

In the majority of the literature there is a strong dichotomy between the two 
approaches. However, in a recent paper Cross and Hancock have observed 
that there are important synergies that can be exploited. Specifically, they have 
noted that there is a chicken-and-egg problem. Before alignment parameters can 
be recovered there need to be correspondences available. Correspondence esti- 
mation, on the other hand, needs information concerning alignment. In order to 
overcome this problem, they develop a dual-step EM algorithm m in which the 

* Corresponding author. Email erh@cs.york.ac.uk 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 246-^^^ 2000. 
© Springer- Verlag Berlin Heidelberg 2000 



Efficient Alignment and Correspondence Using Edit Distance 247 



consistency of the pattern of correspondences is used to constrain the estima- 
tion of alignment parameters. The methods is proved effective in the matching 
of planar point-sets under affine and perspective geometries. 

Despite proving effective the method is computationally demanding. The rea- 
son for this is that the correspondence probabilities, which weight contributions 
to the expected log-likelihood function for alignment parameter estimation are 
computed using a time-consuming dictionary-based method. The aim in this pa- 
per is to address this deficiency by using a more efficient method for computing 
the correspondence probabilities. We turn to the edit-distance model recently 
reported by Myers, Wilson and Hancock m- The main bottle-neck in the Cross 
and Hancock method pj i® the need to model the effects of structural error by 
padding the consistent mappings between graphs with dummy nodes. In the edit 
distance approach, this is simplified by computing the Levenshtein distance bet- 
ween coded strings that represent neighbourhood structure of the graphs being 
matched. By adopting the edit-distance method for computing the correspon- 
dence probabilities, we not only accelerate the Cross and Hancock method, we 
also increase its accuracy. This is attributable to to the fact that the dictionary- 
based method is more likely to become trapped in local minima. 

2 Preliminaries 

One of our goals in this paper is to recover the elements which describes 
a coordinate system transformation that will best bring a model-image feature 
points set z into registration with their counterparts in a data set w. In order to 
do this, we represent each point in the model set by an augmented position vec- 
tor Zi = {xi, Ui, 1)^ where i is the point index. This augmented vector represents 
the two-dimensional point position in a homogeneous coordinate system. We will 
assume that all these points lie on a single plane in the image. In the interests 
of brevity we will denote the entire set of model points by z = G A4} 

where A4 is the points-index set for the model feature-points Zi. The corre- 
sponding fiducial points constituting the data-image are similarly represented 
by w = {wj,\/j G V} where T> denotes the points index-set. 

In this paper we are interested in affine transformations, which has six free 
parameters. These model the two components of translation of the origin on the 
image plane, the overall rotation of the co-ordinate system, the overall scale, 
together with the two parameters of shear. These parameters can be combined 
succinctly into an augmented matrix that takes the form 



With this representation, the affine transformation of co-ordinates is computed 
using the following matrix multiplication 




( 1 ) 




( 2 ) 
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Clearly, this multiplication gives us a vector of the form = (x,y, 1)^. 
The superscript n indicates that the parameters are taken from the iteration 
of our algorithm. 

The basic idea behind the dual step EM algorithm of Cross and Hancock 
is to exploit structural constraints to improve the recovery of affine parameters 
from sets of feature points. Because of its well documented robustness to noise 
and change of viewpoint, we adopt the Delaunay triangulation as our basic 
representation of image structure 0. We establish Delaunay triangulations on 
the data and the model, by seeding Voronoi tessellations from the feature-points. 

The process of Delaunay triangulation generates relational graphs from the 
two sets of point-features. More formally, the point-sets are the nodes of a data 
graph Gd = {T>, Ed} and a model graph Gm = {-Ad, Em}- Here Ed QV xV 
and Em Q M x M are the edge-sets of the data and model graphs. Key to our 
matching process is the idea of using the edge-structure of Delaunay graphs to 
constrain the correspondence matches between the two point-sets. We represent 
the set of correspondences at iteration n by the function : A4 ^ V. In 
other words the statement = j means that the model-point i is in cor- 

respondence with data-point j at iteration n of the matching process. In order 
to construct the expected log-likelihood function we will need to compute the 
consistency of the arrangement of correspondence matches. We therefore let 
denote the probability of the correspondence match = j. In PJ, Cross 

and Hancock observed that the EM algorithm provides a natural framework 
for recovering the required correspondences and aligned point co-ordinates. The 
method is concerned with finding maximum likelihood solutions to problems po- 
sed in terms of missing or hidden data. According to Cross and Hancock, if the 
pattern of correspondences is regarded as missing data, then the task of 
maximising complete likelihood function |w, z) can be posed as that 

of maximising the expected log-likelihood 

!<?(”)) = E E (3) 

Broadly speaking, we can describe the EM algorithm framework as follows. In 
the Expectation step the a posteriori probabilities P{zi\wj,(l>^'^'>) of the missing 
data (i.e. the model-graph measurement vectors, Zi) are updated by substituting 
the point positions vector into the conditional measurement distribution. In the 
Maximization step, by two interleaved substeps the correspondence assignments 
/(n-i-i)(j) _ argmaxjgu are calculated and the updated alig- 

nment matrix 



(p{n+l) _ 




-1 

X 













is estimated {B = zj , the elements of the matrix U are the partial 
derivatives of the affine transformation matrix with respect to the individual 
parameters and E is the variance-covariance matrix for the position errors) . 
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(n) 

The probabilities Q ^ measure the consistency of the pattern of correspon- 
dences when the match (i) = j is made and their computation constitutes the 
main computational bottleneck. Compared with this, the estimation of transfor- 
mation parameters represents a relatively small overhead. Our aim in this paper 
is to find an alternative way to compute Qj in order to speed up the execution 
time of the method. 



3 The Structural Matching Probabilities 

(n) 

In the original dual-step EM algorithm, the gating probabilities Q ^ are compu- 
ted using a dictionary of structure-preserving mappings between the nodes of the 
model-graph and the nodes of the data-graph. These structures are subgraphs 
that consist of neighbourhoods of nodes interconnected by arcs; for convenience 
we refer to these structural subunits as supercliques. The superclique of the node 
i being matched in the model graph with arc-set Em is denoted by the set of 
nodes Q = iU{Z|(i, Z) € Em}- The matched realisation of this superclique under 
the mapping function is Ei = G Ci}. Supercliques are illustra- 

ted in panel (a) of figure ^ which shows a graph with two of its super-cliques 
highlighted. 




Fig. 1. Supercliques as defined by Wilson and Hancock in unj. 



The critical ingredient in developing the matching scheme is the set of feasible 
mappings between each superclique of the model-graph and those of the data 
graph. The set of feasible mappings, or dictionary, for the superclique Ci, is 
denoted by Oij = {S'j} where Sj = jU{l\{j, 1) G Ed}- Each element Sj of Otj, is 
therefore a relation formed on the nodes of the data-graph; hence, the dictionary 
of feasible mappings for the superclique Ci consists of all the consistent relations 
that may be elicited from the data-graph. In practice the dictionary is compiled 
by considering the cyclic permutations of the non-centre nodes in the superclique 
Cj about the centre, as shown in panel (b) of figure E A complication arises from 
the fact that not all supercliques have the same size. In m, Wilson and Hancock 
addressed this problem by padding the dictionary items with dummy labels so 
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that it was the same size of the local configuration. This is essentially a brute 
force method, and may significantly add to the complexity of the dictionaries as 
we show later. 



4 Computing Correspondence Probabilities Using Edit 
Distance 



Recently Myers, Wilson and Hancock have overcome this problem of dictio- 
nary padding by showing how can be computed more efficiently using edit- 
distance El . The Levenshtein or string-edit distance is a measure of the distance 
between lists of differing lengths CEEI This avoids the use of padding alto- 
gether, by considering insertions and deletions in addition to changes. In what 
follows, we work with a simplified dictionary 0°^- which contains only cyclic 
permutations and whose size is therefore equal to \Cj\ — 1. 

Suppose that Pjf g is the optimal edit path between the relational image Ti 
and the unpadded dictionary item S. In their recent paper, they have shown 



that 



Ci,j — 



[ i^wW{Pp. g) + kLL{Pp. g))] 



^jeM ^Se0i 



exp 



- (kwW{P*p^^s) + 



(4) 



where the contents k\y = In and kp = In ^ are defined in terms of 

an error-probability Pg. 



5 Complexity 

As shown by Cross and Hancock in P the computation of is based on 
the number of dictionary comparisons. For a single step we have that the time 
complexity is 0{\0ij\ ■ |S'|). Moreover, the length of the structure-preserving 
mappings, jiSI is linear in the superclique size and will not play a significant role 
in the overall complexity. In m, Myers, Wilson and Hancock have formalised 

that = |Fd| ■ O is an upper bound for the total size of the 

dictionary in terms of the average data graph superclique size \Cj\ when padding 
is required. To compute the edit distance between the two strings Pi and S we 
have adopted the standard algorithm described by Wagner and Fischer in |2j. 
The complexity for a single comparison is 0( Id • I ^l). Since |U| = \Ci\, jS"! = |d| 
and it is sufficient to consider only |Cj| — 1 cyclic permutations of C,, the total 
size of the dictionary is 



jeD 

= |yD|-o(|d|ldf) (5) 

which is polynomial in the size of the supercliques. Moreover, for Delaunay 
graphs, the average node degree (and hence the average superclique size) is 6; 
using this upper bound we can write 
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(l®ijl’Tia^)[Paddmg] “ ^ ^ ^ [Edit] (®) 

Hence, although the edit distance calculation is less efficient than a linear 
comparison with a padded dictionary, the number of dictionary elements to 
compare will be much less than with Wilson and Hancock’s padded dictionary 
approach with a considerable reduction of the execution time. 



6 Experiments 

In this section, we provide experimental evaluation of our coupled matching pro- 
cess. This investigation has two distinct strands. We present both an algorithm 
sensitivity analysis and an application on real world imagery. 



6.1 Sensitivity Analysis 

Here we experiment with the edit-distance based dual-step matching scheme 
and compare it with its dictionary-based counterpart. Our experiments explore 
the sensitivity of the method to structural corruption and affine rotations. The 
experiments are based on synthetic graphs generated by randomly distributing 
20 nodes in a 200 x 200 pixel window. 

Our sensitivity analysis compares the accuracy of the original padded dictio- 
nary based approach with the new edit distance-based method. For both algo- 
rithms the initial affine alignment matrix was the identity matrix and the initial 
correspondences among feature points were all incorrect. To limit the amount of 
computation required by the original algorithm, the maximum allowed amount 
of dictionary padding was 2 nodes per superclique. In each of the following plots 
the red curve indicates the performance of the new edit-distance method, while 
the blue curve are the results obtained with for Cross and Hancock’s original 
algorithm. In each figure the left hand panel shows the final alignment error as 
a function of the average positional deviation and it is expressed in pixels. Sup- 
pose that T C Ai X T> is the set of ground-truth correspondences between the 
uncorrupted portion of the data-graph and the model. If Ug is the final iteration 
number for the matching algorithm, then the measure of registration accuracy 



is ^ = m Sb 



\T\ 






The right-hand panel shows the final fraction of correct correspondences. 

We have subjected the data point-set to two types of error. Firstly, we have 
added varying amounts of Gaussian error to the positions of the points. Here 
the aim is to simulate the effect of point measurement or localisation error. 
The second type of error simulates the effects of a poor feature detection. This 
structural error has been generated randomly adding and deleting points in the 
data graph and subsequently re-triangulating the point-set. 

In Figure El we show the effect of varying the amount of Gaussian measu- 
rement error. In panel (a) the green curve is the initial registration error. The 
edit-distance method consistently outperforms the padded dictionary method. 
When the initial registration error is small, the padded dictionary can actually 
lead to a deterioration in the alignment. 
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Average fraction of positional corruption 




Average fraction of positional corruption 



(a) Registration accuracy 



(b) Structural matching results 



Fig. 2. Effect of positional Gaussian noise. 





Fraction of structural corruption Fraction of structural corruption 



(a) Registration accuracy 



(b) Structural matching results 



Fig. 3. Effect of relational disruption in the data graph. 



In Figure 0 we investigate the effect of structural error. Here, both panels 
show that the new approach is less sensitive to structural corruption. This is 
because the edit distance based mathod tolerates any size difference between 
matching supercliques, whereas the performance of the padded dictionary me- 
thod rapidly decrease with increasing size difference. 

Next we turn our attention to the effects of affine distortion of the point- 
sets. Here we measure the effects of rotation and Figure 0 we show the obtained 
results. Here, we have tested the effectiveness of the two methods by progressively 
rotating the data image with the respect of its geometric centre. The operating 
limit for the old method has been estimated as ±25° while for the edit distance- 
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based method it is ±35°. Panel (a) again shows that the new method is less 
sensitive to noise than the original one. 





(a) Registration accnracy (b) Structural matching results 



Fig. 4. Sensitivity to affine rotation. 



Taken together, these results would suggest that the edit-distance method 
is not only faster than the padded dictionary method, it is also more accurate. 
It must be stressed however, that the original method of Cross and Hancock 
employs two additional refinements not used here. Firstly, it uses an edit process 
to remove poorly matching nodes. Secondly, it anneals the constant Pe with 
iteration number. This will reduce problems associated with convergence to a 
local optimum. 

6.2 Real World Imagery 

This example demonstrates the effectiveness of the dual-step matching scheme 
on real world images. We simulate the task of recognizing planar objects in 
different 2D poses, which is posed by two different images of a 3.5-inch floppy 
disk. 

Panel (a) of Figure El shows the model image on which we have superimposed 
the feature points (corners extracted by hand) and the corresponding Delaunay 
triangulation. As we can see from panel (b), in which the data object is repre- 
sented, our experimentation involves at the same time two components of affine 
transformations: skewing and scaling. 

The sequence in Figure El shows the iterative recovery of the affine geometry. 
Here, we illustrate the iterative registration of the model object against the data 
image. As in the sensitivity analysis, the initial affine alignment matrix was 
the identity matrix and the initial correspondences were all incorrect. The first 
panel of Figure El shows the initial situation in the registration process. Each 
figure in the sequence has been obtained by superimposing the successive model 
image transformations on the data image. The last panel shows that the process 
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(a) Model Object 



(b) Data Object 



Fig. 5. The two different views used in the matching experiments. 



(a) 



(b) 



(c) 



(d) 



Fig. 6. The iterative registration of the model; the steps are ordered from left-to-right 
and top-to-bottom. 
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converges after few iterations; it is clear that the recovered transformation is 
very accurate even if the initial conditions were unfavourable. 

7 Conclusions 

In this paper we have shown that the use of edit-distance can improve both the 
efficiency and accuracy of point-set matching using the Cross and Hancock dual 
step EM algorithm. Based on these promising results, we intend to extend our 
work by incorporating several refinements reported in the original work of Cross 
and Hancock. These include annealing and graph-editing. 
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Abstract. This paper presents a new formalism for irregular pyramids 
based on combinatorial maps. Such pyramid consists of a stack of succes- 
sively reduced graph. Each smaller graph is deduced from the preceding 
one by a set of edges which have to be contracted or removed. In or- 
der to perform parallel contractions or removals, the set of edges to be 
contracted or removed has to verify some properties. Such a set of ed- 
ges is called a Decimation Parameter. A combinatorial map encodes a 
planar graph thanks to two permutations encoding the edges and their 
orientation around the vertices. Combining the useful properties of both 
combinatorial maps and irregular pyramids offers a potential alternative 
for representing structures at multiple levels of abstraction. 



1 Introduction 

The multi-level representation of an image called pyramid allows us to 

define a hierarchy in the different levels of representation of a same object. The 
method has been introduced by Pavlidis |E| who use a pyramid to define several 
partitions of a same image. Each connected component defined at one level is 
linked with its decomposition in the next level. His method defines a hierarchy 
between different partitions of the same image and is thus quite adapted to 
segmentation purpose where the definition of a region depends on the semantical 
context. For example, a face in an image may be considered as one region, or as 
the union of several regions defining the different semantical parts of the face such 
as the eyes and the hairs. The pyramids allow us to define a hierarchy between 
these two representations of a face which represent the same object at different 
levels. The first implementation of pyramids 0 use a regular tessellation of the 
image into a set of squares describing a balanced quadtree. Such a representation 
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called a regular pyramid restricts the way in which objects defined at a given level 
are linked to their father in the next level. This restriction has been attenuated by 
Meer, Jolion and Montanvert j7] which have introduced a new family of pyramids 
called Irregular Pyramids. Kropatsch has shown that irregular pyramids based 
on a pair of dual graphs may encode any partition of an image. 

There are, at least, two ways to represent plane graphs: a pair of dual 
graphs IE0 or combinatorial maps mm- In analogy to regular image pyra- 
mids, dual graph contraction |B| has been used to build irregular graph pyramids 
with the aim to preserve the high efficiency of the regular ancestors while gai- 
ning further flexibility to adapt their structure to the data. Experiences with 
connected component analysis with universal segmentation HU, and with 
topological analysis of line drawings m show the great potential of this concept. 

We have shown in that an encoding of a planar map by a pair of dual 
graphs with an implicit encoding of the orientation may be converted into a 
combinatorial map encoding and conversely. Therefore, any object which may 
be described by 2D-combinatorial maps may also be described by dual graphs. 
However combinatorial maps present several advantages besides dual graphs 
which justify their use within the irregular pyramid framework: 

— The combinatorial maps allow to encode a graph and its dual within a unique 
formalism. Moreover, the implementation also use a unique object to encode 
both graphs. 

— The encoding of the orientation of the plane which is implicit within the 
dual graph formalism is explicit in the combinatorial map one. 

— The combinatorial map formalism may be easily extended to higher dimen- 
sions m Thus using combinatorial maps pyramids, the same formalism 
may be used to define 2D, 3D or 4D Irregular Pyramids. 

The rest of the paper is organized as follows: in Sect. |2| we give some defi- 
nitions and basic properties of combinatorial maps together with the definition 
of the contraction and removal operations. In Sect. 0 we define the notion of 
decimation parameter which allows us to perform several contractions simulta- 
neously. Finally, we give some perspectives opened by our work in Sect.0 



2 Definition and Properties of Combinatorial Maps 

A combinatorial map m may be deduced from a planar graph by splitting each 
edge into two half edges called darts (see Fig. 0). The relation between two darts 
di and d ,2 associated to the same edge is encoded by the permutation a which 
maps di to c ?2 and vice-versa. The permutation a is thus an involution and its 
orbits are denoted by a*{d), for a given dart d. These orbits encode the edges of 
the graph. Moreover, each dart is associated to a unique vertex. The sequence of 
darts encountered when turning around a vertex is encoded by the permutation 
cr. Using a counter-clockwise orientation, the orbit cr*(d) encodes the set of darts 
encountered when turning counter-clockwise around the vertex associated to the 
dart d. A combinatorial map can thus be formally defined by: 
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Definition 1. Combinatorial map 

A combinatorial map G is the triplet G = {'D,a,a), where "D is a set called 
the set of darts and a, a are two permutations defined on TA such that a is an 
involution: 

yd gT> ao a{d) = d 

Given a dart d, a{d) and u{d) will be respectively called the a and a-successors 
ofd. 




V d G {—6, . . . , 6} a{d) = —d 

(a) 
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(b) 



Fig. 1. Each dart of the combinatorial map is encoded by an integer. The permutation 
a associates its opposite to each dart. The permutation a (see (a)) is encoded by an 
array of integers (see (b)). 



Note that, if the darts are encoded by positive and negative integers (see 
Fig. ID, the involution a may be implicitly encoded by the sign: 

Vde-D a(d) = -d 

This convention is often used for practical implementation of connected com- 
binatorial maps fP where the combinatorial map is simply implemented by an 
array of integers encoding the permutation a (see Fig. mb)). 

The permutations a and a allow us to pass from one dart to the other within 
a same connected component. Given a dart d, the set of darts of its connected 
component is denoted by Q.d where Q is the group of permutations generated 
by (T and a: 
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Definition 2. Group associated to a combinatorial map 

Given a combinatorial map G = (X>, a, a), the associated group Q of G is the 
subgroup generated by a and a within the symmetric group of all permutations 

on T>. 



It is clear that two different labellings of the darts gives the same graph. 
This notion of proximity between combinatorial maps has been formalized by 
Gareth [^: 



Definition 3. Morphism between combinatorial maps 

Given two combinatorial maps Gi = (X>i, cti, Oi), G 2 = CT 2 > ^ 2 ) ^m-d 
their associated subgroups Gi and Q^- A. morphism 4> \ G\ ^ G 2 is a pair of fun- 
ctions (x, V')) X • ^^1 ^^2 and ip : T>i -p- T> 2 , where x « group homomorphism 

such that : 

X(ai) = 02 

X(cti) = (72 

and (p respects the orientation: 



Vde I>i 



ip{ai{d)) = a2{ip{d)) 
ip(ai{d)) = (J2(V'(rf)) 



If Ip is bijective (p will be called an isomorphism. 



Note that, if (XjV') is an isomorphism may be rewritten as: 



( 1 ) 



Vdel>i 



ai{d) = ip '^{a 2 {ip{d))) 
ai{d) = ip-'^{(j2{ip{d))) 



( 2 ) 



For example if T>i and T>2 are two sets of darts, and if tt is a bijective 
application from 'Di to T>2, we can show that the two combinatorial maps G\ = 
(X>i, (7, a) and G2 = {1^2, tto a o 7t“^, tt o a o 7t“^), where cr and a are defined on 
'Du are isomorphic. 

Using the combinatorial map formalism, the dual of a combinatorial map 
G = (X>, ( 7 , a) is deduced from G by using ip = a o a instead of cr: 



Definition 4. Dual combinatorial map 

Given a combinatorial map G — {T>,a,a), the combinatorial map G = 
{T>,ip,a) is called the dual ofG. The permutation ip is defined by: 



ip = a o a 

The orbits of ip encode the faces of G. 

The simplicity of the dual transformation avoids encoding explicitly the dual 
combinatorial map. Therefore, each operation performed on a combinatorial map 
will also modify its dual. This last point allows us to reduce the required memory 
and the complexity of our algorithms. 

The removal operation (see Definition 0 is often used to simplify a graph. 
Using irregular pyramids, this operation may be performed to build the different 
levels of the pyramid from an initial graph. 
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Definition 5. Removal Operation 

Given a combinatorial map G = {T>,a,a) and a dart d€'D. If a*{d) is not 
a bridge, the combinatorial map G' = (T>',a',a) = G\a*{d) is defined by: 

- V' = V\a*{d) and 

— a' is deduced from a by: 

\/ d G T>' <j'{d) = with n = Min{p G IN* / a^{d) ^ a*(d)} 

Note that the bridges are excluded from Definition 0in order to keep the number 
of connected components of combinatorial maps. 

If combinatorial maps are used to encode a partition, the merge of two regions 
may be seen in two ways: first, it may be performed by removing one of the com- 
mon boundary segments between the two regions. This operation is performed 
by the removal operation. In this case, each vertex of the combinatorial map is 
associated to the intersection of at least three boundaries (see Fig. P). Secondly, 
the merge of the two regions may be performed by identifying the two regions 
and removing one of the edges encoding their adjacency. In this case, each vertex 
of the combinatorial map is associated to one region. This dual point of view on 
the merge of region is performed by the contraction operation (see Definition 
which may be considered as the dual of the removal operation. 

Definition 6. Contraction operation 

Given a combinatorial map G = (fD, a, a) and one dart d, in 'D which is not 
a self-loop. The contraction of dart d creates the graph G/a*{d) defined by: 

G/a*{d) = G\a*{d) 

The expression of Definitions 0 and 0 in terms of modifications of the per- 
mutation cr are given in 0. 

Note that the contraction operation is well defined since c? is a self-loop in G 
iff it is a bridge in G. Thus, any sequence of removal or contraction operations 
will preserve the number of connected components of the initial graph. This last 
property is useful in the irregular pyramid framework which attempts to simplify 
the initial planar map while preserving its essential structural properties such as 
the number of connected components. 

3 Decimation Parameters 

In order to perform more than one contraction simultaneously, we have to in- 
sure that the resulting combinatorial map is independent of the order of the 
contractions. We have shown that if the contracted combinatorial maps re- 
main connected, the contraction or the removal of any two darts is independent 
of the order of the operations 

Thus if the contraction operations are allowed in a given order, they can be 
performed in any order. The contraction operations are well defined if we do not 
contract a self-loop. Using a Decimation Parameter (see Definition 0, the set of 
darts to be contracted form an Independent Vertex Set (see Fig.|3): 
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Definition 7 . Independent Vertex Set 

Given a combinatorial map G = (X>, u, a), a set of darts T>' C T> will be 
called an Independent Vertex Set iff: 

a{a*{V'))na*{V') = 0 

The set T)' will be called a maximum Independent Vertex Set iff: 

VdGT>-a*{T>') 3 d'Ga*{d) \ a(d') G a*{T>') 

All vertices are defined by one dart ofD' or are linked to one vertex defined by 
a dart in T)' . 



Intuitively, the definition of an Independent Vertex Set consists to select a 
set of vertices called the set of surviving vertices and a set of edges such that the 
surviving vertices are not connected in the induced sub combinatorial map |2|. 
Then the selected edges connect a surviving vertex to a non surviving one. These 
edges become the edges to be contracted in the Decimation Parameter definition: 

Definition 8. Decimation Parameter 

Given a combinatorial map G — {'D,a,o), a Decimation Parameter is a set 
of darts T)' such that T)' is an Independent Vertex Set of G and: 

VdGV-a*{V) Bld'GV' I a{d')Ga*{d) 

The set T) — a*{'D') is the set of surviving darts and is denoted SD. 

The definition of a Decimation Parameter, insures that the edges between 
surviving vertices are not contracted and that exactly one of multiple edges 
incident to a non surviving vertex is contracted. For example, the maximum 
Independent Vertex Set displayed in Fig. I^a) does not satisfy the requirements 
of a Decimation Parameter since it contains a non-surviving vertex connected 
to two edges to be contracted. Fig. |2^b) sastisfies both requirements of an In- 
dependent Vertex Set and a Decimation Parameter. Thus, using the definition 
of a Decimation Parameter, no self-loops can be contracted and the contraction 
operations can be performed simultaneously. 

Definition^ defines a class of elementary paths called connecting paths which 
connect two surviving vertices. According to Definition 0 these paths contain 
exactly one dart which is not contracted. Therefore, the connecting paths may 
be denoted by CP{d) where d is the surviving dart of the path. 



Definition 9 . Connecting Path 

Given a combinatorial map G = ( 1 ?, cr, a), a Decimation Parameter TA' of G 
and two darts bi and 62 in a*(fD'), GP(b\, 62^ will be called a connecting path iff 
it is a path and if it verifies one of the following conditions: 



1 . The vertices a*{bi) and a*{b2) are adjacent: 
CP{bi,b2) = dGSD = T>-a*{T>') 




In this case the dart d is a surviving dart. 
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Fig. 2. A maximal Independent Vertex Set (a) and a Decimation Parameter (h). The 
set of contracted edges T>' is represented by black arrows. The surviving vertices are 
represented in black and have at least one dart in T>' . 



2 . The vertices cr*{bi) and cr*(&2) are separated by one non surviving vertex: 

&i , 



CP{bi,b2) = did2 with: |{(ii, ^2} H 52 ?| = 1 






&2 



In this case the non surviving vertex will be removed by the contraction of d\ 
or d2 ■ Therefore, one of these dart must survive and the other be contracted. 
3 . The vertices u*{bi) and cr*(&2) are separated by two non surviving vertices: 



C'P( 5 i, 62) = did2ds with : |{di, ci2, ^3} H ST>\ = 1 



bi 




In this case the two non surviving vertices will be removed by the contraction 
of di and d^. The dart d2 linking two non surviving vertices can not be 
contracted and is thus a surviving dart. 

The set of surviving darts ST> being symmetric by a, CP{d) and CP{a{d)) 
are defined simultaneously for any dart d in ST>. We can therefore, define the 
a-successor of a connecting path CP{d), denoted ac{CP{d)), as CP{a{d)): 

Definition 10. a-Successor of Connecting Paths 

Given a combinatorial map G = {T>,a,a), and a Decimation Parameter 
T>’ of G, we define the involution ac which associates to each connecting path 
containing one dart d of SV the connecting path which contains a{d): 



'id G SV ac{CP{d)) = CP{a{d)) 
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We have shown in ^ that given a dart d in X>, the sequence of darts 
d,(p{d),(p'^{d) contains at least one non-contracted dart in SV. The first dart 
of this sequence which is not contracted is called the representative dart Rep{d) 
of d: 

Definition 11. Representative Dart 

Given a combinatorial map G = (X>, tr, a) and a Decimation Parameter T)' , 
the representative dart Rep{d) of any dart d in "D is defined by: 



Rep{d) = (f^{d) with i = Min{j G {0, 1, 2} 
and is a non contracted dart belonging to SD 



gP{d) e SV} 



Using the function Rep, we can define the tr-successor of each connecting path 
by: 

Proposition 1. Given a combinatorial map G = {'D,a,a) and a Decimation 
Parameter T>' , the application: 

ac{CP{d)) = CP{Rep{a{d))) 

defines a permutation on the set of connecting paths. 

The proof of this proposition may be found in Pj. 

If we denote by T>c the set of connecting paths defined by a Decimation 
Parameter V' , the involution ac and the permutation gq define a combinatorial 
map on the set of connecting paths: 

Definition 12. Connecting Path Map 

Given a combinatorial map G = (X>, a, a), and a Decimation Parameter T>' , 
the set of connecting path T>c, may be defined by: 

T>c = {CP{d),dGSV} 

The map of connecting paths associated to the Decimation Parameter Gc is 
defined by: 

Gc = (iC>c,o-c,ac) 

(see Dehnition ll H and Propositional ■ 

Since each connecting path is uniquely defined by one dart in SV, we can 
consider CP as a bijective application which associates to each dart in SV its 
associated connecting path. Then, if G' = {SV,a' ,a') denotes the contracted 
combinatorial map G/a*{T>'), and if x denote the application: 




where G' and Gc respectively denote the group of permutations associated to 
G' and Gc (see Definitional). 
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The application (x, CP) is an isomorphism between the contracted map G' 
and the connecting path map §Gc (see proof in ^). Then, the permutations a' 
and a' may be respectively deduced form the permutation uc and ac as follow 
(see (0): 

Vd P -D-nC-D'l / = CP-\ac{CPm = CP-\CP{a{d)))) = ot{d) 

^ ^ \ (j'{d) = CP-^{(Tc{CP{d))) = CP-\CP{Rep{a{d)))) = Rep{a{d)) 

Therefore, the contracted combinatorial map G' = G/a*{T>') may be con- 
structed from G, by leaving the permutation a unchanged and by computing for 
each surviving dart the value Rep{a{d)) i.e. by searching the minimal integer j 
in {0,1,2} such that (p^{a{d)) belongs to ST>. If this operation is performed in 
parallel on each surviving dart, the contracted combinatorial map may be built 
in constant time. 



4 Conclusion and Perspectives 

We have defined in this article the theorical framework needed to perform remo- 
val or contraction operations on combinatorial maps. The contraction operation 
is then generalized thanks to the definition of Decimation Parameter. These 
definitions allow us to design several contractions in parallel. 

The definition of a contraction kernel by labeled pyramids is under develop- 
ment. This expected result together with the ones resumed in this article should 
allow us to study interesting applications of our model such as: segmentation P] 
El, efficient structural matching |IS| or integration of moving objects. Finally, 
the extension of our model to higher dimensional spaces (3D) should be studied. 
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Abstract: In this paper a new distance for attributed relational graphs is 
proposed. The main idea of the new algorithm is to decompose the graphs to be 
matched into smaller subgraphs. The matching process is then done at the level 
of the decomposed subgraphs based on the concept of error-correcting 
transformations. The distance between two graphs is found to be the minimum 
of a weighted bipartite graph constructed from the decomposed subgraphs. The 
average computational complexity of the proposed distance is found to be 
O(A^), which is much better than many techniques. 



1- Introduction 

The search for general structural mathematical models has led workers in the field 
of pattern recognition to study graphs, for these can be of direct use in describing 
relations between the elements of a set of objects. Further, graph theory methods can 
be used in a wide variety of problems and for this reason much study has been given to 
the mathematical and algorithmic properties of graphs. 

Attributed relational graphs ARGs are one of the most powerful tools in describing 
structured objects. In this representation, nodes represent primitives or subpatterns of 
structured objects and branches between nodes represent relations between primitives 
or subpatterns [1]. 

One way to recognize the structure of an unknown pattern is to transform it into an 
ARG, then match it with other ARGs representing structures of prototype patterns. 
This process of matching is called graph isomorphism. Formally, two graphs G and G’ 
are said to be isomorphic (to each other) if there is a one-to-one correspondence 
between their vertices and between their edges such that incidence relationship is 
preserved [2]. If the isomorphism is encountered between a graph and a subgraph of 
another larger graph, then the problem is called subgraph isomorphism or graph 
monomorphism. Fargest common subgraphs problem is to find an isomorphic 
mapping between subgraphs of G and subgraphs of G’. 

Graph isomorphism was widely used as a powerful tool for matching and 
recognizing structured objects using different techniques like : inexact graph matching 
[4], relaxation methods [5], Cartesian graph product [6,17], error-correcting 
transformations [7,14], neural networks [8,18], graduate assignment [9] and direct 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 266-276, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 




A New Error-Correcting Distance for Attributed Relational Graph Problems 267 



classification of node attendance [10]. Trials for matching weighted graphs were 
shown in [11-13]. Some of the applications were demonstrated in [19, 27]. 

The main problem of using graph isomorphism as a tool for graph matching that it 
is only permitted when matched graphs have some common structures and that means, 
graph isomorphism can not be used when matching graphs with different structures 
[21]. In this case, a measure for distance between graphs is needed. Some 
contributions were recognized in introducing efficient distances between graphs as in 
[1, 22-30] using three general methods based on 1- feature extraction [20], 2- graph 
grammar [22-24], and 3- error-correcting transformations [1, 25-30]. The main 
problems of these distance measures are the complexity which may grow up 
exponentially when increasing the sizes of matched graphs and their deficiency in 
handling graph isomorphism problems. 

In this paper a distance measure between attributed relational graphs is introduced. 
The proposed distance can be efficiently used for determining the isomorphism 
between matched graphs. The basic idea of the proposed algorithm is to decompose 
the matched graphs into smaller subgraphs and perform the matching between the 
graphs at the level of their decomposed subgraphs based on the concept of error- 
correcting transformations. 

The process of graph decomposition, and how to match the decomposed 
subgraphs are shown in section 2. Section 3 introduces the proposed algorithm for 
calculating the distance between matched graphs with analysis of its computational 
complexity. Experimental results are presented in section 4, and finally the 
conclusions of the proposed algorithm are given in section 5. 

2- Graph Decomposition 

In this section the process of graph decomposition into smaller subgraphs is 
introduced, followed by proposing the matching algorithm of these decomposed 
subgraphs. 

2-1- Decomposition of Attributed Relational Graphs. 

Simplifying the structure of matched graphs will certainly reduce the overall 
complexity of an algorithm that enhances the performance. 

The graphs resulting from decomposing an ARG are called Basic Attributed 
Relational Graphs (BARGs) and this notion is adopted from [27]. The BARG is in the 
form of one level tree which consists of a node and all nodes connected to that node 
whether the connected branches are in or out of that node. The structure of these 
BARGs gives the matching process more stability, increases the data associated with 
decomposition process, facilitates the matching process, and preserves the structure of 
the original graph. 

2-2- Matching BARGs 

Matching BARGs is much easier than matching ARGs and this is because the 
structure of BARGs is simpler and easier in processing than ARGs. The same distance 
measures for ARGs can also be adopted for BARGs. In this paper, we use the concept 
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of error-correcting transformations, where the cost of matching two ARGs is defined 
as the cost of the sequence of transformations that possesses minimum total cost and 
that must be performed on one of the two ARGs in order to produce the other ARG. 
The operations permitted to transform one ARG to the other are : node insertion, node 
deletion, branch insertion, branch deletion, node label substitution and branch label 
substitution. A cost (weight) is associated with each operation and its value is 
determined by an optimization procedure or heuristically. The costs corresponding to 
each operation are : Wjjj, and respectively. Some research in 

this area can be found in [1,7, 14, 25-28]. 

A new operation structure preservation is added here with cost vv^p. The main 
function of the new operation is to help preserving the global structure of the original 
ARG after performing the operation of graph decomposition and its use will be 
declared later in this section. 

Given two BARGs, say U and V as shown in Fig. 1 . the cost of matching U and V 
is calculated as follows : 





Fig. 2. corresponding weighted 
bipartite graph 



Distance (U,V) = Wns*dist(M;,Vy) + min(wjji,Wncl)*abs(fe-p) H- 

min(H’| 3 j,H’] 3 (j)*abs(A:-p) + dist(i)’s,c’s) ( 1 ) 

where, dist(ni,vj) = distance between node labels of n,- and vj and node attributes of 
n’i and v’j and is calculated depending on their data types, k and p are the number of 
branches connected to the root node of the matched BARGs. 

dist(h’s,e’s) is calculated as the minimum of a weighted bipartite graph constructed 
from h’s and e’s as their nodes and its structure is as shown in Fig. 2. 

the weight of any branch connecting two nodes, say and e/j in the bipartite graph 
is the distance between bf and eh and is calculated as follows : 

distance(hy, eh) = wbs * dist(hy, eh) + Wgp * dist(«y, vh) (2) 

where dist(x,y) is the distance between x and y and their attributes based on their 
data type. 

The minimum of the weighted bipartite graph can be calculated by many 
algorithms [31-33] and is known in the literature as assignment problem and it has 
computational complexity O(n^) in the worst case and (?(n2) in the average case[33]. 
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3- The Algorithm 

The main idea of the proposed algorithm is to decompose both reference and input 
graphs; say and G/ respectively; to be matched into BARGs as previously 
mentioned in section 2-1. A distance matrix “D” between reference and input graphs is 
then constructed and is equivalent to the distances between the BARGs of both 
reference and input graphs. The labels of the rows in the distance matrix represent the 
BARGs of the input graph, while the labels of the columns ^jjepresent the BARGs of the 
referencCj^raph. Dij represents the distance between the i BARG in the input graph 
and the j BARG in the reference graph and is calculated as mentioned in section 2-2. 

After calculating the distance matrix D, a weighted bipartite graph is constructed 
which is equivalent to the distance matrix D where each branch connecting two nodes, 
one represents a BARG from the input graph and the other node represents a BARG 
from the reference graph, has a weight equivalent to the distance between the two 
BARGs connected by this branch as shown in Fig. 3. 




Fig. 3. (a) the distance matrix, (b) corresponding weighted bipartite graph matrix. 

Distance (input_graph, reference _gmph) = minimum_weighted_bipartite_graph 
(distance matrix) + unmatched_branches + unmatched_nodes (3) 

Every pair of BARGs (one BARG from input graph and the other is from reference 
graph) in the weighted bipartite graph connected by a branch whose weight is taken in 
calculating the minimum of the weighted bipartite graph, is considered to be matched, 
i.e., the node rooted the BARG of input graph matches the node rooted the BARG of 
reference graph. Fig 4-(a) shows the structure of minimum weighted bipartite graph 
while Fig. 4-(b) shows the position of unmatched branched and nodes. 





Fig. 4. (a) Min. weighted bipartite graph, (b)Unmatched branches and nodes 
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3- 1- Analysis of Computational Complexity 

Suppose that we are given a reference graph R with M nodes and an input graph I 
with N nodes. The first step in the algorithm is to decompose both R and I into 
BARGs. It is obvious that this step is done in quadratic time for each graph, i.e., has 
time complexity of 0(M^+N^). 

The second step in the algorithm is to calculate the distance between all the BARGs 
of input and reference graphs. The computational complexity of matching two BARGs 
mainly depends on the calculation of error-transformations between the two BARGs 
and the computational complexity of getting the minimum of a weighted bipartite 
graph. The computational complexity of calculating the error-correcting 
transformations is 0{&), where 6 denotes the maximum number of branches 
connected to any node in the matched BARGs. The average computational complexity 
of getting the minimum of a weighted bipartite graph is 0(M*N) [48], assuming that 
any node is connected to all other nodes in the graph, so we can consider the 
computational complexity of matching two BARGs is 0(M*N),. The calculation of the 
distance between the BARGs is repeated for all the BARGs of both input and reference 
graphs, and that means, the average computational complexity of second step is 
where M*N > § . The best and worst cases for this step have 
computational complexity of 0(M*N*min(M,N)) and 0{M^*N^*min(M,N)) 
respectively. 

The third step is to calculate the cost of matching input and reference graphs which 
is defined as the minimum of the weighted bipartite graph constructed in step 2 and 
the computational complexity of this step is 0(M*N). 

The last step is to count number of unmatched branches and unmatched nodes. This 
step has complexity of quadratic order. 

In summary the average computational complexity of the proposed algorithm in 
calculating the cost of matching two ARGs with M and N nodes is 0{M^*N^). The 
best and worst computational complexity are 0(M*N*min(M,N)) and 
0(M^*N^*min(M,N)) respectively. The computational complexity of the new 
algorithm is much better than other algorithms reported, in the literature [1, 7, 9, 10, 
14, 25-30]. 

4- Experimental Results 

4-1- Distance between Graphs ; 

We start by demonstrating the capability of the new algorithm in calculating the 
distance between attributed relational graphs. 

The problem is to identify an image graph of the prominent runways of the 
Jacksonville airport from the picture shown in Fig. 5. [3]. The image graph has to be 
matched with three runway models of the open-V runway configuration. Airports that 
have open-V runway are Houston airport, Jacksonville airport, and finally Mid- 
continent airport and are shown in Fig 6- (a), (b), and (c) respectively. A vertex 
represents a runway and has an attribute that corresponds to the length of this runway. 
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Each edge has two attrihutes; center to center distance between connected vertices and 
cute angle between them [3]. 





Fig. 6. (a) Runway model A., (b) Runway model B., (c) Runway model C 



Table I shows the output of the proposed algorithm, which indicates that the image 
graph of the Jacksonville runways and runway model B incur the lowest cost of 
matching. The results of matching are consistent and identical with the results 
published in [3]. 







Vertex Mapping 


Model A 


5660 


(1,6),(2,5),(3,1),(4,2),(5,4),(6,3 

) 


Model B 


3847 


(1,1),(2,2),(3,4),(4,5),(5,6),(6,7 

) 


Model C 


4430 


(1,5),(2,4),(3,1),(4,2),(5,3),(6,8 

) 



Table I. Results of subgraph optimal isomorphism. 



4-2- Graph Isomorphism Problem ; 

In this experiment, the performance of the proposed algorithm is tested in matching 
sparsely attributed relational graphs in different noise levels. Attributed relational 
graphs of 50 nodes are generated with degree of connectivity 7 , where 7 e { 10%, 15%, 
20%, 25%}. Two nodes are connected by an edge with probability 7 . Different noise 
levels are added to edge and node attributes. Noise levels are in the range of {0.00, 
0.04, 0.08, 0.12, 0.16, 0.20). After adding the noise, the graph is shuffled and matched 
with the original one. One hundred trial is produced for each connectivity and noise 
level. Fig. 7 shows the results obtained from applying the proposed algorithm. 
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Sparsely Attributed Relational Graphs 
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Fig. 7. Results of experiment 2 for sparsely attributed relational graphs. 



4-3- Subgraph Isomorphism Problem ; 

In this section the performance of the proposed algorithm is tested in handling the 
problem of subgraph isomorphism. 

In this experiment we use simulation method to test the performance of the new 
algorithm in handling sparsely connected graphs. Attributed relational graphs of size 
100 are generated with 10% degree of connectivity. The weights of edges are real 
numbers and are produced randomly in the rang of 0 - 1. Nodes have five random 
binary valued attributes. After the generation of the graphs, 60% or 80% nodes are 
deleted and uniform noise is added to the edges. The noise levels are in the rang of 
{0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20}. One hundred trials are 
run at each noise level. The results of the proposed algorithm in comparison with the 
results obtained from the algorithm of graduated assignment [9] are shown in figures 
8-9 where (a) is the proposed algorithm and (b) is the graduated assignment algorithm. 
From figures 8-9, it can be concluded that the proposed algorithm has lower 
percentage of incorrect matches than the recent technique of graduated assignment [9]. 





Fig. 8 Results of matching ARGs with Fig. 9 Results of matching ARGs with 
5 binary attributes and 60% deleted. 5 binary attributes and 80% deleted. 

4-4- Largest Common Subgraphs Problem; 

Finally, the new algorithm is tested in handling the problem of largest common 
subgraphs. A modification is done in the proposed algorithm that some of the entries 
in the results of the minimum weighted bipartite graph are excluded because their 
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values (which are equivalent to dissimilarity distance between some BARGs) are 
higher than some pre-defined threshold value "tigp". The threshold value is application 
dependent and is determined by an optimization procedure or heuristically. The main 
purpose of is to ensure that only isomorphic nodes are included. Fig. 10 shows the 
distribution of BARGs construction the minimum weighted bipartite graph. 



Largest common 
subgraphs < , 

1 

X*x X ' 

X ** 1 

* 1 




- 


1 

^ iso 




> 

Distance between 
BARGs 



Fig. 10. Distribution of BARGs and expected position of 



Fig. 11 and Fig. 12 show different scans for an autonomous robot [17]. The data 
contained in each scan is represented by attributed relational graph where vertices are 
boundaries and classified to four categories : wall, pseudo-boundary, partial-wall, and 
reference-boundary. The length of each boundary is added as a vertex attribute. The 
edges in the graph represent the angles between the boundaries. Table II. shows the 
results obtained from the proposed algorithm on this data. The results are completely 
consistent and identical with those in [17] . The algorithm of [17] was reported to be 
exponential. 




Fig. 11. Scan 1 Fig. 12. Scan 2 
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Table II. Results of Exp. 1. 







274 Y. El-Sonbaty and M.A. Ismail 



5- Conclusions 

In this paper, a new algorithm for evaluating the distance between attributed 

relational graphs is proposed. From experimental results and complexity analysis, the 

following points can be concluded: 

1- the new algorithm can be used efficiently for sparsely and fully connected attributed 

relational graphs, and also for other types of graphs like attributed graphs and 
weighted graphs 

2- the new algorithm has the capability of handling different isomorphism problems 
like graph isomorphism, subgraph isomorphism and largest common subgraphs 
with distinguished performance, 

3- the computational complexity of this algorithm is much lower than other techniques 

found in the literature. The average computational complexity of the proposed 
algorithm is found to be 0(hfi*N^), 

4- the best and worst case for the computational complexity of the new algorithm is 
0(M*N*min(M,N)) and 0{M^*N^*min{M,N)) respectively, which is better than 
many techniques, 

5- the proposed algorithm is based on the concept of error-correcting transformations 
which is more powerful and has more meaning in calculating the distance than 
other techniques. 

6- the proposed algorithm is parallel in nature and can take advantage of hardware 
parallel architecture. 
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Abstract. Function-Described Graphs (FDGs) have been introduced as a repre- 
sentation of an ensemble of Attributed Graphs (AGs) for structural pattern rec- 
ognition and a distance measure using restrictions between AGs and FDGs has 
been reported. Nevertheless, in real applications, AGs can be distorted by some 
external noise, and therefore some constraints have to be relaxed. To gain more 
flexibility and robustness, some local costs may be added to the global cost of 
the labelling depending on the fulfilment of the graph element constraints of the 
FDG instead of applying hard binary constraints. 



1 Introduction 

Function-Described Graphs (FDGs) were introduced in [2] and redefined in [3] as 
a representation of an ensemble of Attributed Graphs (AGs) for structural pattern 
recognition different from Random Graphs [5]. Some 2”“* order relations (antagonism, 
existence and occurrence of a pair of vertices or a pair of arcs) are introduced to the 
FDGs to keep, to the most, the structure of the ensemble of the AGs. The synthesis of 
FDGs was studied in [1]. Here, a new distance measure, relaxing second order restric- 
tions is presented. 

Relations of second order defined on the FDGs are useful to constrain the set of 
possible labellings while computing the distance with restrictions between AGs and 
FDGs. This is aimed at reaching the best labelling function, taking into account the 
second order information obtained from the structure of the cluster of AGs that was 
used to synthesise the FDG. Nevertheless, in real applications, AGs can be distorted 
by some external noise, and therefore, the constraints associated with the second order 
relations have to be relaxed to avoid that a noisy AG be misclassified due to non- 
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fulfilment of any of these constraints. A distance relaxing 2“‘ order restrictions is 
presented here. To gain more flexibility and robustness, some local non-negative costs 
may be added to the global cost of the labelling depending on the second-order prob- 
abilities of the graph elements, instead of applying hard binary constraints. 

The organisation of this paper is as follows: AGs and FDGs are reviewed in sec- 
tions 2 and 3, respectively. The new distance is proposed in section 4 and applied to 
the 3D-object recognition problem in section 5. Finally, some conclusions are 
sketched in section 6 . 



2 Attributed Graphs 

Let H = be a directed graph structure of order n where = {v^ | A: = !,...,«} 

is a set of vertices (or nodes) and 2 ^ = \e^j | i, j e {l ,. i ^ j} is a set of edges (or 
arcs). We use the term graph element to refer to either a vertex or an edge. Let A^, and 
A^be the global domains of possible values for non-null attributed vertices and arcs, 

respectively. A null value of a graph element is represented by (F . 

An attributed graph G over (a^,A^) with an underlying graph structure 

H = {1,^,1, J is defined to be a pair (v,A) where V = {T,^,y^) is an attributed vertex 
set and A = is an attributed arc set. The mappings >A^ and 

>A^ assign attribute values to vertices and arcs, respectively, where 
A, =A,u{ 0 } and A^=A,u{d)}. 

A complete AG is an AG with a complete graph structure H (but possibly including 
null elements). An attributed graph G = {V,A) of order n can be extended to form a 
complete AG G’=(V’,A’) of order k,k>n, by adding vertices and arcs with null 
attribute values O . We call G’ the k-extension of G. 



3 Function-Described Graphs 

A function-described graph F over (a,,,A^) with an underlying graph structure 
H = is defined to be a tuple {w ,B,P,R) such that 

I- W = ^ random vertex set and 7 ^ : 2^ — >• is a mapping that associ- 

ates each vertex a>. e with a random variable ct. = 7 ^(®, ) with values in A^ ■ 

2. B = ( 2 ^, 7 ^) is a random arc set and 7 ^ : 2^ ^ is a mapping that associates 
each arc e 2 ^ with a random variable p. = y^(s^i) with values in A^ ■ 

3. P = are two sets of marginal (or first-order) probability density func- 

tions for random vertices and edges, respectively. This is, p^ = {p, (a), 1 < i < n} and 
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= [qj (b), 1 < j<m\ (being m the number of edges), where p. (a) = Pr(a, = a) 

for all aeA^ and (b) = Pr(/?^ = b | O Aa ^2 ^ O) for all beA^ such that 

a ,, a ^ refer to the random variables for the endpoints of the random arc associated 

with p. . By definition, Pr(^^. = (J | = (J v = <t) = 1 ■ 

4. R = {A ,A ,0 ,0 ,E ,E ) is a collection of boolean functions defined over 

pairs of graph elements (i.e. relations on the sets of vertices and arcs) that allow the 
incorporation of qualitative second-order prohahility information. A^ and A^ are the 

vertex antagonism and arc antagonism functions, respectively, where 
->{0,l} is defined by A^(ry,.,ryJ = 1 <=> Pr(«; 5^ O Act^. 5 ^ 0 ) = 0 , and 

similarly, A^ : x ^ {0,l} is defined by A^ {£f,^ , ) = 1 

Pr(/?,. ^ O A ^ O) = 0 , where /?. = {e ^^ ) and p^ = y^ ) . In addition, and 

are the vertex occurrence and arc occurrence functions, where 
is defined by 1 Pr(a, T^OAtt^ =o)=0, and 

is defined by ^ = 1 Pr(/?,. t^Oa/?^. =ch)=0. We 

say that two graph elements (of the same type) are co-occurrent if and only if the 
occurrence relation applies to them in both directions. Finally, E^ and E^ are the 

vertex existence and arc existence functions, where x ^ {0»l} is defined 

by (ftjj , ) = 1 Ci> Pr(a,. = 4) A = <h) = 0 , and : Z^ x Z^ ^ {0,l} is defined by 

eXsu,sJ = \<^Pv[p, =Oa/I, =o)=0. 

A random element 5 of an FDG is a null random element if its probability of in- 
stantiation to the null value is one, Pr((5 = O) = 1 . A complete FDG is an FDG with a 
complete graph structure H. An FDG F = {W,B,P,R) of order n can be extended to 
form a complete FDG F'= {W’,B’,P\R’) of order k,k>n, by adding null vertices 
and null arcs and extending appropriately both the set of probability density functions 
and the boolean functions that relate graph elements. We call F’ the k-extension of F. 



4 Distance between AGs and FDGs Using 1*‘ and 2"“ Order Costs 



We require a fine but robust matching cost that makes powerful use of the meas- 
urement information in the data graphs (attribute values) and in the prototypes (ran- 
dom variable distributions) as well as an effective way of constraining the possible 
matches, if we want the system to have the capability of discerning between proto- 
types. The matching measure must be soft for two reasons: first, because it is assumed 
that in real applications the patterns are distorted by noise, and second, because a pro- 
totype has to represent not only the objects in the reference set but also the ones that 
are “near” them. 
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First of all, and for the sake of robustness, the mapping h is not defined from the 
initial AG that represents the pattern to the initial FDG that represents the class, but 
from the A:-extended AG to the A:-extended FDG, to contemplate the possibility of 
some missing graph elements or some extraneous graph elements introduced by noisy 
effects. A missing element in the AG will be represented by a null element in the 
extended AG, and an extraneous element in the AG should be mapped to a null ele- 
ment in the extended FDG. Since it is desired to allow a priori all the isomorphisms, 
the number of vertices k in the extended graphs is set to the sum of the number of 
vertices in both initial graphs. Hence, the limit situations in which all the graph ele- 
ments in the FDG are missing in the AG or all the graph elements in the AG are extra- 
neous are covered. 

Let G’ be a A:-extension of the AG G and F' he a. fe-extension of the FDG F . 
Then, G’ and F are structurally isomorphic and complete with the same number of 
vertices k , and they also share a common attribute domain (A^jA^). Now, the la- 
belling function is defined as a mapping h : G F ’ . Since graphs do not have any 
predetermined orientation and each orientation is given by a morphism h , a global 
cost Cf^ is associated with each h in a set of valid mappings H , and the measure of 
dissimilarity is defined as the minimum of all such costs. 



d - min | f 

hsH ^ ’ 



( 1 ) 



In addition, an optimal labelling is given by 
hj =arg min {Q } 



( 2 ) 



The set of valid mappings H contains all the bijective functions that are coherent 
structurally (i.e. the arc labelling is totally determined by the vertex labelling). 

We want the global cost to provide a quantitative idea of the match quality 
through the mapping h based on the joint conditional probability that the AG is gen- 
erated from the FDG given labelling /t , this is, = /wttc(p(G|/t)) as presented in 

[3]. For instance, = — ln(p(G|/t)) would be a possible choice, but it is not the 

most adequate because of its high sensitivity to noise. Only that one of the probabili- 
ties was zero, then the obtained distance would be oo . Note that the joint probability 

p(G|/t) cannot be estimated directly and has to be approximated by the product of 

the first-order probabilities of the elements. In this case, the previous choice is 
equivalent to 

C, = - J]ln(Pr(y(y) = y{x)\ h{x) = y)) 



(3) 
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where x and y are graph elements in the AG and the FDG respectively, /{y) is the 
random variable associated with y, /{x) is the attribute value in x, and all the ele- 
ments of both graphs have to appear in the productory (possibly by extending the do- 
main and range of the mapping with null elements). 

However, only that one graph element had a probability of zero, the joint probabil- 
ity would be zero and would be infinite. Since this may happen due to the noisy 

presence of an unexpected element (insertion) or the absence of a prototype element 
(deletion), only that one graph element were not properly mapped due to clutter, the 
involved graphs would be wrongly considered to be completely different. 

Hence, it is better to decompose the global cost into the sum of bounded indi- 
vidual costs associated with the element matches. Although it has the major flaw that 
the joint probability is not considered as a whole, it has the advantage that clutter af- 
fects only locally the global cost. An individual cost C{x, y) represents the dissimi- 
larity between two mapped elements x and y, and it could be based still on the first- 
order probabilities of the elements, C(x,y) = /l<nc(Pr()'(y) = /{x'j^ h(x) = 3^)), 
as far as is bounded by some fixed constant, C{x,y) < Max , for instance 
C(x,y) <1. 

The global cost is therefore computed as the sum of the individual costs of all the 
matches between graph elements, 

C,=XC(x,h(x)) (4) 

Mx 

The main concepts underlying the definition of the distance measures between AGs 
and FDGs have been introduced above. To define now the different specific measures, 
it is only needed to define the set of valid mappings H and the individual costs 

C{x,y). 



Individual Costs of Matching Elements 

We now turn our attention into the individual cost of matching a pair of elements, 
one from an AG and one from an FDG. It is defined as a normalised function de- 
pending on the dissimilarity between the two mapped elements, as given by the nega- 
tive logarithm of the probability of instantiating the random element of the FDG to the 
corresponding attribute value in the AG, this is 

1 



c{x,y)=\ 



otherwise 
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where the cost C{x, is hounded by [ 0 . 1 ]. and the positive constant 
K„ € [ 0 , 1 ] is a threshold on low probabilities that is introduced to avoid the case 
ln(0), which gives negative infinity. Hence, C(a, }^) = 1 will be the cost of match- 
ing a null element of the FDG to a non-null element of the AG or matching an FDG 
element to an AG element whose attribute value has a very low probability of instan- 
tiation, that is Pr(j^( 3 ;) = /(x) | h(x) = j) < Kp^ . 

In the case of the vertices, the individual cost is defined using the probabilities 
stored in the FDG as 






. -ln(^pr) 



if 



1 otherwise 



( 6 ) 



And in the case of the arcs, the individual cost is defined using the arc conditional 
probabilities as follows. Let = in the AG arc and let Ys^^ab)~ Pn 

matched FDG arc. Then, in general, 




ln(g„(bm)) 



if 

otherwise 



(7) 



However, if either V, or V , is a null extended vertex in the AG, then the condi- 

tional probability (b^ ) is not applicable, since depends on the existence of the two 
extreme vertices, and must be replaced by the conditional probability 
Vx\B =b let = O V Cf. = (b), whichis 1 if b = O and 0 otherwise. 



Second Order Costs of Matching Elements 

The second order costs could be defined for the vertices as shown in equations (8) 
to (10), where it is assumed that /t(v,) = CO^ and h{vj'j= CO^ . These equations cover 

respectively the three following qualitative cases: presence of two vertices in the AG, 
presence of only one of them, and absence of both vertices. Note that, the second- 
order costs induced artificially by FDG null vertices are not taken into account. 

fa,- 7^:cDAa; 7^<J>A ^ 

0 otherwise 
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i \ 


1-Pi{a ?^OAa =o) 


if 


^a, ?^OAa^ =Oa^ 


"TF 


\ p (i / 






0 


otherwise 



(9) 






l-Pr{a^ 
0 



„ .Aa,=d)) if 



^ a, =d)Aa^ =d)A 



( 10 ) 



otherwise 



The definition of the costs on the arcs, C ^ , 8^^ j , 

Co\(^ij^^kt^^ab^^cd) and are similar than the costs on the 

vertices (See [3] for more details). 

Since the second-order probabilities are not actually stored in the FDGs, they are 
replaced by the second-order relations, thus obtaining costs that are coarser. This is, 
some second-order non-negative costs are added to the global cost of the labelling 
when second-order constraints (antagonism, occurrence, existence) are broken. Equa- 
tions (11) to (13) show the final second-order costs, which can be only 1 or 0, associ- 
ated with the three relations of antagonism, occurrence and existence between pairs of 
vertices. 





/ \ 




fa, TiOAa,- A ) 


> 

II 




if\ 


‘ J 




0 


otherwise 




oX(Op,%) i/a,. ?^d)Aa^ =cpAp^(cI))?^l('22) 
0 otherwise 




0 



a, = O A a , = fl) A 



if 

otherwise 






(13) 



Global Cost 

The global cost on the labelling function is defined with two terms that depend 

on the first-order probability information, and six more terms that depend on the sec- 
ond-order constraints: 
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Zq(v,. 
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Vcii 


^^ofG’ 






K^* 


ZC,.(v, 
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K,* 






+Ke 


* H^oMj 


^kAAAk,))+ (14) 
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K^* 




,Vj,}(v. ),}{vj)) 


+K, 


* Z 






W^,Vy€Z^,ofG’ 






Vei,-,%e£,ofC’ 





The eight terms are weighted by non-negative constants to K ,^ , to compensate 

for the different number of elements in the additions as well as to balance the influ- 
ence of second-order costs with respect to first-order costs in the overall value. Note 
that if K- = cc : i = 3 .. S there are strict constraints associated with the second order 
relations and so the distance with 2”** order restrictions is obtained. 



5 Results 

The contribution of FDGs to structural pattern recognition is illustrated by the 
three-dimensional object recognition problem. The original data is composed by 101 
AGs, which represent the semantic and structural information of the views taken from 
five objects. Figure 1 shows a selected view of each object. 




Figure 1. 



Vertices in the AGs represent the faces, with one attribute, which is the area of the 
face. Arcs represent the edges between faces, with one attribute, which is the length of 
the edge. Five FDGs were built from the AGs that represent their views using a super- 
vised synthesis method. An antagonism relation between two graph elements appears 
when these elements have never seen together in the same view. On the other hand, an 
occurrence relation appears when a graph element is visible in all the views in which 
another one is visible too. There is not any existence relation because there is no pair 
of faces such that at least one of the two faces is visible in all views. See [3] for more 
details. The object of the tests presented here is to assess the effects of the application 
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of the antagonism and occurrence relations between vertices in the computation of our 
distance. To that aim, some tests have been carried out with different weights on these 
relations ( ). The other weights have been set as follows: K^= \ (vertices), 

K 2 =1/2 (arcs), = 0 (existence on the vertices), = 0 (second 

order relations on the arcs). The distance (optimal cost) presented in Section 4 was 
computed by means of a branch-and-bound algorithm [3]. 

The test set was composed by random AGs, which were the previous AGs modified 
by some structural or semantic noise. Results shown are the average of the correctness 
of the classification tests performed 20 times. The semantic noise, which is added to 
the attribute values of the vertices and arcs, is obtained by a random number genera- 
tion with a median of 0.0 and a standard deviation: 0.0, 4.0, 8.0 and 12.0. The struc- 
tural noise, also obtained by a random number generator, deletes or includes 0, 1 or 2 
vertices, which represent the 0%, 20% and 40% of the average structure, respectively. 
The experimental results are summarised in Table 1. 



# noise vertices 


0 


0 


0 


0 


0 


1 


2 


1 


Standard Deviation 


0.0 


2.0 


4.0 


8.0 


12.0 


0.0 


0.0 


8.0 


0 

II 


0 

II 


100 


90.1 


89.7 


88.6 


86.3 


70.8 


67.7 


68.7 


K j = CO 


0 

II 


100 


92.5 


89.3 


87.0 


84.9 


61.6 


54.4 


57.4 


0 

II 


AT 5 = CO 


100 


91.9 


89.9 


88.2 


85.2 


62.5 


59.5 


59.5 


= ca ■ 


. K, = os 


100 


95.1 


90.2 


86.6 


80.8 


60.7 


53.2 


56.2 


K , = 1 ^ 


0 

II 


100 


92.3 


91.5 


91.3 


87.2 


80.5 


75.3 


75.5 


0 

II 




100 


95.6 


92.4 


91.5 


86.4 


81.2 


77.2 


76.4 


K , = 1 ■■ 


. K, = l 


100 


98.7 


97.1 


95.0 


92.5 


89.2 


85.2 


83.6 


Nearest neighbour 


100 


98.9 


82.6 


62.6 


52.4 


90.0 


58.6 


58.6 



Table 1. Recognition ratio (%) obtained by the FDGs and by the nearest-neighbour classifier 



(using Sanfeliu’s distance between AGs [4]) resulted from applying different levels of noise. 

The classification correctness is higher applying strict relations ( = 00 and 
= 00 ) than without applying these two relations ( = 0 and = 0 ) when the 

semantic or structural noise is low, but, when the noise increases, best results appear 
when no relations are applied. On the contrary, the distance with both second-order 
costs (^3=1 and = 1 ) always obtains higher results than if one of the relations is 

not taken into account. The FDG classifier only obtains worse results than the nearest- 
neighbour classifier when the structural or semantic noise is very low. 
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6 Conclusions 

A new distance between AGs and FDGs has been reported. The aim of this distance is 
to gain more flexibility and robustness throughout relaxing the second order con- 
straints. Some local costs have been added to the global cost depending on the second 
order probabilities instead of applying hard binary constraints. Results show that the 
distance with 2” order costs obtains better results than the distance with strict 2” order 
restrictions or without considering them. 

The main problem of computing a distance between graphs associated with an op- 
timal match is the exponential cost. While a branch-and-bound algorithm [3] was used 
in the reported experiments, a more efficient but sub-optimal method has been pre- 
sented recently [6] . 
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Abstract. Many authors have already proposed linear feature extrac- 
tion algorithms. In most cases, these algorithms can not guarantee the 
extraction of adjacency relations between extracted features. Object con- 
tours appearing in the analyzed images are often fragmented into non- 
connected features. Nevertheless, the use of some topological information 
enables to reduce substantially the complexity of matching and registra- 
tion algorithms. 

Here, we formulate the problem of linear feature extraction as an op- 
timal labelling problem of a topological map obtained from low level 
operations. The originality of our approach is the maintaining of this 
data structure during the extraction process and the formulation of the 
problem of feature extraction as a global optimization problem. 

keywords Contour map, feature extraction, model selection, MDL 



1 Introduction 

A structural description of an image, relevant for example in an architectural 
context, is an issue of computer vision. Many algorithms use structural informa- 
tion. Examples of such algorithms are registration algorithms in data bases of 
images or three dimensional models 0, or calibration algorithms. 

Elementary structural elements can be classified into three categories, accor- 
ding to their dimension, i.e. interest points, linear features and regions. More 
complex features, such as line pencils (vanishing points), collinear segments, etc, 
may be pertinent in specific applications. 

Although the different kind of elementary features are complementary, they 
are usually extracted independently from each other. For example, some authors 
proposed techniques of interest point detection, such as corners and multiple 
junctions 0. Other authors proposed linear feature extraction techniques from 
independent point sets Pj, chained points HH, or thick point sets P3|. Seg- 
mentation algorithms, which partition an image into regions of homogeneous 
properties, are numerous. The problem of consistency between the different ty- 
pes of features is faced when one tries to organize those features into a global 
adjacency structure. 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 2S7- IWII 2000. 
@ Springer- Verlag Berlin Heidelberg 2000 
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The perceptual grouping paradigm is widely used to solve this problem m 
Techniques using perceptual organization are resistant to occlusions, but are 
very sensitive to noise and to parameter values. More recently, Fuchs and al. 0 
proposed a multi-primitive extraction system where conflicts between features 
extracted independently are detected, thanks to a structure that stores adjacency 
relations between the different features. 

Other coherent approaches are used in line drawing vectorization 0. The 
main interesting point of those approaches consists in using a skeleton of a 
binary image obtained from the initial bilevel image. This skeleton is viewed as 
a graph of discrete points, which guarantees the consistency of the adjacency 
relations between the extracted line segments. Junction points and faces can be 
naturally deduced from the graph structure. Line segments are extracted with 
respect to local criteria, but the extension of these methods to other curves seems 
problematical. 

In this paper, we propose a new method for linear feature extraction that 
enables the retrieval of topological relations between the extracted features. It 
can be applied to bilevel images as well as on multilevel images. The first step 
of the extraction process is the construction of a low level topological structure 
called contour map (section EJ . The problem of linear features extraction is then 
viewed as an optimal labelling problem, which is derived in the second section 
of this article (section OJ, and a simple sub-optimal algorithm is presented. We 
then propose an example illustrating the usefulness of the proposed contribution 
(section EJ. 



2 Contour Map 

2.1 Definition 

A contour map is a topological map embedded in the discrete plane : its vertices 
are points of its Jordan arcs are 4-connected curves, and its faces are 8- 
connected sets of points. 

A topological map is a data structure representing a cellular decomposition 
of a plane, that codes the faces and the edges of that decomposition, and that 
permits to access efficiently to the adjacency relations between the different 
kind of elements. Such a structure has already been used for the edition of 
two-dimensional drawings 0, and for representing images with respect to an 
inter-pixel topology m 



2.2 Construction 

A contour map can be obtained from a straightforward algorithm introduced 
first by M. Pierrot Deseilligny and al. m- We suppose that the crests of an 
image / represents the contours of an original image. An image I is defined as a 
function with a discrete support (pixels (x, y)), such as all its values are different. 
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This condition can be fulfilled by defining / as a concatenation of functions. The 
image of the figure 2.c is obtained using the following tabulated function : 

I{x,y) = [\Grad{x,y)\,D_{x,y),x,y\ (1) 

where \Grad{x, y) \ is the norm of the gradient of the initial image, and D_ (x, y) 
is the shortest distance of a point {x, y) to a point {x' , y') such as \Grad{x, y)\ > 
\Grad{x' ,y')\. 

The contour map is built thanks to a simple algorithm based on a local analy- 
sis of the 8-neighborhood of each pixel p of I. Each neighborhood is decomposed 
into sets of 4-connected components which values are greater than the value of 
the central pixel. We construct the contour map by adding an edge that connects 
the central pixel with the highest value pixel of each component. 

Pierrot Deseilligny and al. demonstrated that a bijection exists between the 
faces of the topological map and the local minima of I. A homotopic trans- 
formation is then applied, which suppresses the pending darts of the previous 
combinatorial map until stability. This operation conserves the number of fa- 
ces and of connected components of the map. The resulting map is called an 
elementary contour map. The edges of this map bind two 4-connected pixels. 

On a real image, we obtain an over-segmentation of the initial image, as 
shown on the image of the figured This is due to noise, textures, ... The re- 
levant curves are noisy, but well localized. The contour map can be simplified 
with the use of pre-treatment such as a selective smoothing , or by applying a 
segmentation algorithm. This kind of operations does not necessarily conserve 
curves of interest. 



3 Features Extraction and Labelling Problem 

Feature extraction is viewed here as the segmentation of the contour map into 
4-connected discrete curves, according to different models of curves. This gua- 
rantees that the connectivity of the map is preserved during the segmentation 
process and that no adjacency information is lost. 

3.1 Model Selection 

How can we choose the model which best fits a given sample of measured data 
among a set of possible models ? To formalize it, let a; = x -I- e be a vector of 
measures, where e is the error vector and x is the vector of the exact values of 
the measures. A statistical model is known when we know the law of e. A model 
m of a feature (for example, a linear feature model) is a set of relations which 
can be represented by a function / that verifies f{x,p) = 0 where p is a vector 
of parameters describing this feature. 

Let M = {mi,i = l..A^} be the set of all possible models and fi their 
associated functions. We are then seeking the feature of model m of M that is 
the most adapted to the data sample x. Let s{x,mi,pi) be a criterion of model 
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Fig. 1. Construction of a contour map. The upper left image is the initial image. 
The upper right image represents the modulus of the gradient of the previous one. The 
bottom left image shows the resulting contour map after the suppression of the pending 
arcs. Each black point of the image corresponds to an edge. The bottom right image 
was obtained by thresholding the locally maximal curves to suppress over-segmentation 
problems 



selection, which is a function that measures the appropriateness of the feature 
of model and of parameter pi to the data x. For a given model nii and a 
given data sample x, the minimization of this function gives an estimate pi of the 
parameters of the most appropriate feature of m^. The best model to according 
to the data x is the model realizing the minimum of s on M . 

iTi = mi/s{x,rrii,pi) = min s{x,mj,pj) (2) 

3 

Note that the estimates pi of the different features are required to find the 
model TO. An example of such a criterion is given by Rissanen’s Minimum De- 
scription Length Principle (MDL) (see PZj, for a clear introduction). It has 
been largely used in the computer vision literature According to 

that principle, the feature which is the most adapted to the data is the one 
that minimize the length of the code of the data which are described with that 
feature. The model selection criterion is : 



s{x, nii^pi) = - log2 P{x/pi) + sizeofinii) 



( 3 ) 
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where P{x/pi) is the a posteriori probability of x knowing the parameters pi of 
the feature, and sizeofirrii) is the size of the code of the feature which is equal 
to the amount of memory needed to stock the vector of parameters. 

Note that for a given model, the description length criterion is similar to the 
maximum likehood criterion. 



3.2 Optimal Labelling of a Contour Topological Map 

We seek to segment an elementary contour topological map into a set of cur- 
ves following known models, in order to find a continuous representation which 
’’explains” the discrete map. Those curves can be represented by paths of the 
map, each path being associated to a label and a model of a given model set. A 
model selection criterion is used for selecting both the paths and their associated 
models. It enables to give a label to each edge of the elementary contour map. 
In order to simplify the problem, we impose that each edge must have only one 
label. 

Let G be an elementary contour topological map, C the set of all possible 
paths of G, and M the set of all admissible curve models. We associate to each 
vertex of the elementary contour topological map some measures (coordinates, 
gradient direction of the initial image ...). We can then construct a vector com- 
posed of measures associated to each vertex of the path with a path cj. We will 
use the abusive notation s{cj, mi,pij) which is the model selection criterion built 
with measures realized along the path Cj. 

The set F of all admissible linear connected features is defined as a subset of 
C X M which elements respect the following constraint : the parametric represen- 
tation of each point with respect to the model and its estimated parameters must 
conserve the order of the points along the path. This constraint is illustrated on 
figure El 




Fig. 2. Curve constraint. The drawings of this figure represent two 4-connected paths 
and possible continuous analogs. The left drawing is a correct possible curve, while the 
right drawing does not fulfill the constraint 



Definition 1. A subset E of F is independent if and only if the paths associated 
to the elements of E do not share an arc. 



Definition 2. An independent subset E of F is of maximal size if and only if 
\/a € F, E U {a} is not independent. 
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It is straightforward to figure out that the paths included in E contain every 
edges of G. E is equivalent to a labelling of all the edges of G. Let define the 
optimality criterion : 

Definition 3 (: Optimal labelling). An optimal labelling of G according to a 
model set M and a model selection criterion s is the independent subset E of F 
of maximal size minimizing : 



Note that there is not necessary an unique optimal labelling. Finding an 
optimal labelling seems to be a NP-complete problem in the general case, if 
no hypothesis on s, M, and G are made. Classical combinatorial optimization 
algorithms, such as the A algorithm, or relaxation, can be used to solve this 
problem, but the implementation of such algorithms is not efficient, partly due 
to the amount of data required. 

3.3 Sub-Optimal Labelling Algorithm 

As the labelling problem is complex, we do not seek to obtain an optimal solution. 
In this sub-section, we present an algorithm that can be used with any model 
set. This algorithm can be decomposed into three steps. 

This first step enables a substantial reduction of the amount of computation 
required in the following steps. We construct a new topological map from the 
contour map such that edges are discrete lines, with at most two adjacent faces. 
Simple classical algorithms m can be used. 

For each arc of the new map, a local analysis is realized, that enables to find 
the linear connected feature that minimizes the function 



where Cj is a path that starts by the considered arc, nii is a model of M, mo is 
the model of independent point, and poj is composed of the measures associated 
to each point of Cj. The model mo is the noise model, and is implicitly used when 
no model fits the considered path. The description length s(cj,mo,poj) equals 
the code length of poj . 

The local analysis is driven by a deep first strategy. Each of the ’’optimal” 
linear feature is inserted into a priority queue, ordered according to the previous 
function. 

An independent set E of curve of maximal size is then constructed, using 
the extracted features. While the priority queue is not empty, the first curve 
Cj of the queue is extracted. If if U Cj is independent, then cj is included in 
the set E. If it is not, Cj is split into n curves and n' curves c"^-, such that 
E U {cL, 1 < i < n} is independent, and Vi'|l < i' < n',if U {c"j} is not. The 
c[j are then inserted in the priority queue. 





{cj,mi)£E 



s{cj,m^,pij) - s{cj,mo,poj) 



( 4 ) 
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This algorithm is clearly non optimal. The local analysis may result in a 
curve that is locally, but not globally, optimum according to the model selection 
criterion. Moreover, the complexity of this algorithm is very high, each admissible 
curve being analyzed. To reduce the computational cost of this algorithm, the 
deep first analysis is realized until all the possible estimated models of curves 
fail a confidence test based on the the maximum residual distance between the 
points and the estimated curve. The threshold value can be chosen in such a way 
that it does not change the result of the algorithm. 

4 Application 

In this section, we give an example of application with a model set composed 
of two models : a line segment model and a noise model (section 14. 1 j) . We then 
present some results obtained on real images (section 14.211 . 

4.1 Models 

In this sub-section, we derive the two previous models. Let Cj be a path of 
an elementary contour map. The data vector constructed along the path Cj is 
composed of the coordinates of each vertex of the path, and of the initial image 
gradient direction at each of those points. This direction is supposed to be close 
to the normal vector of an object contour. This measure is used to discriminate 
the real contours from the spurious ones. 

The description length of the noise model is then : 



where rij is the number of points along the path Cj, and treal is the amount of 
memory needed to store the chosen representation of a real. 

The line segment model is composed of sub-matrix (one sub-matrix per point 
P) of the following form : 



where xp and yp are the coordinates of P, 9 the angle between the normal 
vector of the line and the x axis, d the distance of the origin to the line, and 
„ and „ are the residuals of f. 

d,P 6,P 

The estimation of the parameters of the line can be calculated by non-linear 
constrained optimization Pj. In order to reduce the computation cost, we can 
use a linear least square estimator, considering the set of equations ax+by = — 1, 
where a and b are the new parameters of the line, or an estimate obtained by 
sampling the data points, using techniques such that the least median of squares 
or RANSAC. 



S (Cj , ?roZSe,Poj) ^^j^real 



( 5 ) 




Xp cos 0 + yp sin 0 — d 
9 -0P 
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Once the optimal parameters has been obtained, the description length of 
the path by the line has to be computed. The residuals of the observations 
{xp, yp, dp) must then be calculated, which is costly. We will use the residuals of 
/, r^p and p- We suppose that r-^p follows a normal law of standard deviation 
ad, and that follows a normal law of standard deviation uq. Moreover, we 
suppose that residuals are independent. The description length of a curve Cj 
according to a line segment model is given, after a straightforward development, 
by: 



s {cj, line, [O d)) ^ Uj log 2 (2tt 



21n2 ^ 

P^c. 




( 7 ) 



real 



where and eg are the resolution of the digitalization of 9 and d, t^eai is the 
code size in bits of a real, and is the number of points in the curve Cj. The 
coding length of the model is (2 + nj)t^eai because 2 reals are necessary to code 
the parameters 9 and d, and a real is needed to code the parametric coordinate 
of each point along the segment. 



4.2 Results 

In this section, we discuss the results of the algorithm of sub-section l,4.,4l on the 
image of the figure n (figure EJ • 

We first estimate the standard deviations ad and ag. To do so, test lines can 
be manually selected, and their estimates can be computed. Those estimates can 
be used for an image sequence. In our example, 8 lines were used, ad was lying 
between 0.36 and 0.28 pixel, with a mean of 0.34 pixel and ag between 0.05 and 
0.386 radian with a mean of 0.14 radian. 

In our application, the linear estimator is more robust than the least median 
of squares estimator. The used MDL criterion is strong enough to drop lines 
with outliers in the linear least squares case. Whereas a large number of random 
samples does not guarantee the repeatability of the extraction of local consi- 
stent labellings. It is well known HS| that the MDL criterion tends to accept 
outliers. Although not especially addressed here, this problem is very sensitive 
in our algorithm. The early acceptance of a feature with outliers (in this case, 
points belonging to another feature) can give a global labelling result that is not 
satisfactory. This problem is solved here by underestimating the standard devia- 
tions. Experiments show that a low ag gives a stronger filtering criterion, while 
a low ad gives lines with well localized end points. The figure 01 demonstrates 
the resulting labels with a low ag. The lower ag, the most powerful the filtering 
effect is. The last figure gives an example of correct labelling. 

This method is well adapted to images without contrasted random textures. 
It can be applied on other types of images, and have been tested succesfully on 
aerial images of urban environment. On any images, complex shapes of contours 
are split in simple linear models from the initial model set. When the model 
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Fig. 3. labelled lines (right column) and noise (left colnmn). Upper images : ad^O-SA 
pixel and ae 0.14 radians, Lower images : ad=0.12 pixel and <T0=O.O4 radians 



set contains only the line segment model, the result is a polygonization of the 
contours. Although the method presented here is less performant in term of time 
than other methods dedicated to line detection, it can easily be extended to other 
types of linear features. Moreover heuristics can be designed for each model of 
the model set in order to reduce the amount of computation time. 

5 Conclusion 

In this contribution, we have presented the problem of linear feature extraction 
as a combinatorial optimization problem. We have also proposed a sub-optimal 
algorithm that has been applied to the extraction of lines segments in outdoor 
photographs. Although it does not perform perfectly, it demonstrates the use- 
fulness of our approach. Moreover, only two parameters, which values can be 
estimated, are needed. Actual and future works concern the extension of this 
approach to face features. 
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Abstract. We address the problem of object recognition in computer vision. We 
propose an invariant representation of the model and scene in the form of Attribu- 
ted Relational Graph with focus on region based measurements rather than purely 
interest points. This approach enhances the stability of scene image representation 
in the presence of noise and significant scaling. Improved solution is achieved by 
employing a multiple region representation at each node of the ARG.The matching 
of scene and model ARGs is accomplished using probabilistic relaxation that has 
been modified to cope with multiple scene representation. The preliminary results 
obtained in experiments with real data are encouraging. 

Keywords: Object Recognition, Invariant Representation, Computer Vision 



1 Introduction 

The recognition of objects in cluttered scenes is one of the most challenging problems in 
computer vision. The problem is inherently difficult not only because of the omnipresent 
noise but also due to a host of other factors which are intrinsic to the process of object 
sensing using imaging techniques. These factors include geometric transformation of 
the measurements as a result of changing view point, geometric distortion due to the 
imperfection of the imaging optics, occlusion and clutter, 3D nature of objects and last 
but not least, the lack of object specificity. In this paper we shall not be concerned with 
the last issue which raises the question how generic classes of objects, for example a 
chair, should be represented so that any member of a class can be easily recognised even 
if it has never been seen by the vision system before. This is the subject of research in the 
domain of syntactic, structural and functional object modelling Q. We shall not even 
be addressing explicitly the 3D nature of objects. Instead we shall take the view that 
objects can be represented as a union of planar surfaces. This "orange peel" modelling 
is fully appropriate for a large family of objects which are essentially polyhedral and 
provides an adequate approximation in a significant number of practical situations where 
the deviation from planarity of an object face can be absorbed into other geometric 
distortions the system would have to be able to cope with. Thus the focus of the paper is 
on recognition of faces of specific objects, which can be viewed to be dominantly planar, 
subject to the combined effect of noise, viewing transformation changes, geometric 
distortion, occlusion and scene clutter. 

F.J. Feni et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 297-11771 2000. 

© Springer- Verlag Berlin Heidelberg 2000 
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In a model based object recognition there are two major interrelated problems, na- 
mely that of object representation and the closely related problem of object matching. 
A number representation techniques have been proposed in the computer vision lite- 
rature which can be broadly classified into two categories: feature based, and holistic 
(appearance based). We shall not be dismissive of the appearance based techniques as 
they clear possess positive merits and no doubt can play a complementary role in object 
recognition. However, our motivation for focusing on feature based techniques is their 
natural propensity to cope better with occlusion and local distortion. 

The matching process endeavours to establish a correspondence between the features 
of an observed object and a hypothesised model. This invariably involves the determi- 
nation of the object pose. The various object recognition techniques proposed in the 
literature differ in the way the models are invoked and verified. The techniques range 
from the alignment methods f Q where the hypothesised interpretation of image data 
and the viewing transformation is based on the correspondence of a minimal set of fea- 
tures. The candidate interpretation and pose is then verified in regard to other image 
and model features. As there can be a large number of combinations of candidate mo- 
del and pose hypotheses the method is relatively time consuming. The other end of the 
methodological spectrum is occupied by geometric hashingO 0 or hough transform 
methods Q where all the scene features are used jointly to index into a model database. 
Once a model is invoked its pose and verification can be accomplished by means of mo- 
del back projection. This approach is likely to require a smaller number of hypothesis 
verifications. However its success is largely dependent on the ability reliably to extract 
distinctive features. 

In an earlier work m it was argued that an effective object recognition method 
should be based on the extraction of relatively simple features as only such features 
can be reliably detected in complex images. The distinctiveness of such features can be 
enhanced by relational measurements. However, these should be of low order to mini- 
mise the combinatorial computational complexities of both the feature extraction and 
model matching, and to maximise the probability of the features being observable. It 
was also argued that the processes of model invocation and pose estimation should be 
combined into a single, unified matching mechanism. As a suitable tool to achieve the 
latter objective the evidence combination method of relaxation labelling was advoca- 
ted. The method uses an attributed relational graph for scene and model representation 
employing only unary and binary relations which are made invariant to any pertinent 
geometric transformation group. 

The actual implementation of the method used as features interest points on occluding 
and surface texture boundaries. The method was shown to work well in experiments 
involving both synthetic 2D and real 3D objects in cluttered backgrounds. However, it 
was found that its ability to recognise objects deteriorates for complex scenes containing 
many objects when the size of each object becomes inevitably small. In such situations 
the extraction of interest points becomes very unreliable and as a result the recognition 
performance degrades. It was noted that in contrast, under scaling, regions remain stable 
over a large range of scales. In this paper we propose the use of region based features 
for attributed relational graph representation of scene and model objects. Apart from 
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its desirable stability under scaling, the proposed representation can naturally exploit 
powerful cues such as region colour in the recognition process. 

The paper is organised as follows. In the following Section we described the propo- 
sed representation. The relaxation algorithm adopted for the attributed relational graph 
matching is described in Section 3. The experiments carried out and the results obtained 
are presented in Section 4 . Section 5 draws the paper to conclusion. 



2 Representation 

In this section we address the problem of affine invariant representation of both scene 
and model based on regional features. For this purpose we regard an image of the scene 
or model as a set of regions. Then for each region we provide a basis matrix which allows 
us to transform the region to a normalised space in which the corresponding regions of 
model and scene are identical. Eventually we construct an Attribute Relational Graph in 
which normalised regions are considered as graph nodes and binary relations between 
region pairs constitute graph links. 

It has been shown in nm that a matrix B which possesses the following properties 
can be used to transform a region to an invariant space: 

1 . B is a non-singular matrix. 

2. Matrices B and B' associated with corresponding regions R,R' respectively are 
related as B' = BT where T is a transformation matrix which maps R to R' .In 
other word R' = RT. 

Such a matrix is called basis matrix. 

Using the basis matrix, B, barycentric coordinates of an arbitrary point P of region R 
can be defined as : 



Cb{P) = PB~^ ( 1 ) 

The defined coordinate system is transformation group invariant since an arbitrary point 
P of region R and the corresponding point P' of region R' have the same barycentric 
coordinates: 



Cb'{P') = Cbt{PT) = (PT)(BT)-i = Cb{P) (2) 

The barycentric coordinates can be used as unary relations of region R. Similarly, a 
binary relation matrix A^j associated with a pair of regions Ri and Rj can be defined as 
Aij = BiBj^. Considering AU as binary relation matrix related to i?' and i?'( corre- 
sponding regions of Ri and Rj in the scene ), we can readily show that: 



AU = B[B'-^ = {B,T){BjT)-^ = B,TT-^BJ^ = A,, (3) 

Thus the binary relation matrices are transformation group invariant as well. 

For affine invariance it can be shown that the following matrix has the properties of basis 
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matrix which transforms an arbitrary region i? to a normalised region with unit area im : 

/ 1 0 0\~^ / Cx-Xo yo-Cy 

B = Kn \ -ui^i/uo.2 k 0 \ X Cy-yo c^-xq 0 (4) 

V 0 0 1/ \X^+y^-yoCy-XoC:^ XoCy-yoCa: kn / 

where points C{cx,Cy) and Po{xo,yo) are two reference points of region R, kn is the 
distance of the two reference points, k = fc„/ {Area of region) and rti.i,Mo ,2 are the 
second order moments of the modified region R. 

In the proposed method we regard the centroid of a region as a one of the required 
reference points. The way the second reference point is selected is different for scene and 
model. In the case of the model the highest curvature point on the boundary of the region 
is chosen as the second reference point, while in the scene for each region a number of 
points of high curvature are picked and consequently a number of representations for 
each scene region are provided. The selection of more than one point on the boundary is 
motivated by the fact that an affine transformation may change the ranking and distort 
the position of high curvature points. 

Given a set of regions and their boundaries we construct the associated bases using the 
above reference points. Each region and the corresponding bases constitute a node in 
the image representation graphs. Unary attributes of the graph can be defined as the 
barycentic coordinates of some representative points measurable in both scene and mo- 
del images. In addition, other auxiliary attributes such as region colour can be defined 
for this purpose. As well as binary geometric relations Aij, as we could define chromatic 
relations by considering the respective colours of the two regions. 

Now we represent the model graph as G = {f2,A,A} where 17 = {a>i,tU 2 , • • • ,wm} 
denotes the set of nodes (normalised regions) and X = {xiA 2 t‘ ■ ■ Am} is a set of un- 
ary measurement vectors where for each node iOi we have a vector of measurements 
Xi including colour, second order moments of the normalised region and the barycen- 
tric coordinates of a number of judiciously selected points on the region boundary. 
A — {Aij\{i,j)i,j G represents a set of binary measurement vec- 

tors associated with the node pairs so that each measurement vector Aij associated with 
a pair of nodes oji and ojj contains the binary relation matrix and the size ratio of the 
two regions. 

Similarly, the graph ,G = {a,X,A}, represents the scene image.The only thing that 
makes a significant difference in scene representation is that more than one bases are 
provided for each scene region. The multiple representation for each scene node is 
defined in terms of a set of unary measurement the vectors Xi where index k indicates 
that the vector is associated with the fcth representation of the ith node. Also each pair 
of the regions and Oj and the associated bases Bf,Bj define binary relation The 
multiple unary measurement vectors and binary relation constitute the combined 
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3 Matching 

The graph matching problem has been approached in many different ways in the com- 
puter vision literature. Many attempts to reduce the inherent complexity of the graph 
matching problem have been reported. Among the proposed methods the relaxation la- 
belling is recognised as one of the most effective methods. The technique was introduced 
by Rosenfeld Hummer and Zucker|21. The method uses contextual information to update 
the probability distribution of labels of each node of graph in an iterative manner. The 
major criticism of the early relaxation algorithm was that its foundations and design me- 
thodology were very heuristic. However the subsequent work of Hummer and ZuckerO 
and Kittler and Hancock 0 overcame these major points of criticism. In particular |S] 
provided theoretical underpinning of probabilistic relaxation using a Bayesian frame- 
work. This work was further extended by Christmas et alffUl by incorporating binary 
relation measurements into the relaxation process. The introduction of measurements 
throughout the relaxation process made the relaxation labelling approach more efficient. 
We have adopted the relaxation labelling technique of ^Bl and | IT2 1| and adapted it to 
our problem of matching a scene graph with multiple representations. The problem 
considered contrasts with previous application of relaxation methods where a unique 
representation exists at each node of the scene graph. Similarly to lEI we add a null 
label to the label set to reduce the probability of incorrect labelling. The essential dif- 
ference in our matching problem is that the product support function derived in ca is 
not applicable due to the scene clutter driving the total support to zero, thus masking 
the coherent support even from consistently labelled objects. For this reason we have 
adopted the benevolent sum support function to measure the supporting evidence from 
the neighbouring objects as in [Sj. 

We formulate the matching problem as one of assignment of a proper label from the label 
set SI = {ujQ,- ■ ■ ,u!m} to each object of a = {oi, •••, uat} where label coq is the null 
label assigned to the objects for which no other label is appropriate. Let p{0^ = cogk) 
denote the probability of label ujgk being the correct interpretation of object Ui using 

i 

the fcth representation where k G {!,■ ■ ■ ,L} and L is the number of representations for 
object Gi . In the iterative process we consider all possible assignments for object 
and the related probabilities using their previous values and supports provided by other 
objects. As mentioned before we combine the iteration rules in [0 and m to derive a 
new iteration formula defined as: 



where function Q quantifies the support that assignment (9^ = oja) receives at the nth 
iteration step from the other objects in the scene and Ij is the Ith representation currently 
active at node aj . 




(5) 







(6) 
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As can be seen from the support function, we involve only one of the available repre- 
sentations for each neighbouring object at a time in the support evaluation and it is 
shown as index Ij. We shall refer to it as the most likely representation. Although at 
the beginning of the algorithm the most likely representation for each object is selected 
randomly, after updating object label probabilities for all the graph nodes we find the 
most likely representation for each object by computing the representation entropy. The 
representation with the minimum entropy of labelling assignment identifies the most 
likely representation. 

In the first step of the process the probabilities are initialised based on the unary mea- 
surements extracted. Denote by = ujgk) the initial label probabilities evaluated 

i 

using the unary attributes as: 

= ujgk)=p{0f =iVgk\Xi) (7) 

i i 

Applying the Bayes theorem we have : 







|6»f = ujgk)p{e^ = iVgk) 

= ^o)p{0t = uja) 



( 8 ) 



Let ( be the proportion of scene nodes that will assume the null label. Then the prior 
label probabilities will be given as : 



P{0f 




A = 0 {null label) 
A 0 



( 9 ) 



where M is the number of labels (model nodes). 

Assuming that the distribution function of errors in unary measurements is Gaussian and 
statically independent we can express the distribution function as : 

P{x’l\0^ =UJa)=M^k{Xa,Su) ( 10 ) 

i 



where A7„ is a diagonal covariance matrix for measurement vector which depends 
on the noise level in the extracted unary measurements. In support function Q the term 
p{A^ \9f = u)a,dj = LOp) behaves as a compatibility coefficient in other relaxation 

kl ■ 

methods. In fact it is the density function for the binary measurement given the 

matches 9^ = Ua and 6*^^ = ojp where index Ij refers to the current most likely represen- 
tation for object aj . To define the binary distribution function we centre it on the model 
binary measurement Aap and assume that deviations from this mean are modelled by a 
Gaussian. Thus we have: 

p{Af/ \9i = UJa,X^ = up) (AapjSb) ( 11 ) 

kl • 

where Sg is the covariance matrix of the binary measurement vector Al, / . 

The iterative process will be terminated in one of the following circumstances: 
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1 . In the last iteration , none of the probabilities changes by more than threshold e. 

2. The number of iterations reaches some specified limit 

It seems reasonable that the representation of each object to be considered is the most 
likely representation. With this strategy we will assign to each object the most unambi- 
guous labelling at any point in the iterative updating process. 

4 Experimental Results 

In this section, we provide some preliminary results to demonstrate the potential of the 
proposed method. We start by demonstrating one of the key advantages of our method 
with respect to the previous work llTUI . which derives from the adopted representation. 
As discussed in Section 2 dealing with representation, we normalise each region using 
two reference points chosen on the region. In our method the region centroid and a high 
curvature point on the region boundary are selected as reference points in contrast with 
the previous workflill in which both reference points were picked from among points 
of high curvature. 




Fig. 1. A sample region at different scales 



To illustrate our motivation for this choice consider figure 1 where a region and its 
scaled instances are presented. It should be noted that for clarity small objects are not 
shown in real size. For comparison the points of high curvature and the region centroids 
corresponding to the regions are shown in figure 2. As can be seen under scaling interest 
points may disappear or their relative position changed and this will affect the reference 
bases computed for each object . This instability will further be aggravated by noise, 
imaging transformation and change in orientation. In contrast the position of the centroid 
is considerably stable. This is why we use the region centroid as one of the reference 
points in our representation. 

To give an intuitive feeling for the effect of the region normalisation we consider two 
corresponding regions, shown in figure 3a which have been obtained as a result of diffe- 
rent affine transformations. The marked points on the regions are the extracted reference 
points for each region. The result of normalisation are shown in figure 3b. As one can 
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Fig. 2. Interest points and region centroids extracted for regions in Fig.l 




(a) 



(b) 



Fig. 3. normalised representation of sample region 



see there is a negligible difference between the region boundaries, which is due to the 
presence of error in the determination of the corresponding reference points. 

We now proceed to demonstrate the recognition ability of the proposed method in a 
realistic scenario involving real images. Shown in figure 4a the object to be recognised 
is a cereal box. A relatively clear, close-up view of the object is used as the model image. 
In figure 4b we show a number of regions in the model image used for object representa- 
tion. We used the colour segmentation method proposed in Q to extract image regions. 
The matching method is applied to complex test images containing a number of other 
objects. Two examples are shown in figure 5 and 6. As can be seen imaging viewpoint 
on the scene is such that the object of interest in these test images is significantly smal- 
ler than the model and the related regions are considerably deformed. Also the scene 
images and the model image are taken in different illumination conditions. Figures 5b 
and 6b show the regions in the test images which have been correctly interpreted. They 
are presented as black and gray regions. Although there are relatively a large number 
of regions in the scene images (75 and 56 regions in figures 5b and 6b respectively), 
the method is able to recognise about 80% of the model regions correctly while all the 
irrelevant regions in the test images take the null label. 

In our implementation we chose three bases for each scene region to provide a relatively 
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Fig. 4. The model image and a number of constituent regions 



reliable representation. Note that the improved robustness resulting from the use of a 
multiple representation for each region was inevitahly achieved at the expense of increa- 
sed complexity of the method. As mentioned in the representation section we use region 
colour as one of the unary measurements associated with each node in the ARG graph. 
Since the test images are taken in different illumination conditions we use the YUV 
colour system with emphasis on the chromaticity components U and V which contain 
pure colour information. The intensity component Y is ignored in the matching stage. 
Interestingly the iterative process converges fast. After five steps the difference bet- 
ween the corresponding label probabilities in two successive steps is negligible(less 
than .0001). It should he also mentioned that the matching algorithm is insensitive to 
parameters and Ei, which have been determined experimentally. 



5 Conclusions 



We addressed the problem of object recognition in computer vision. An invariant repre- 
sentation of the model and scene in the form of Attributed Relational Graph with focus 
on region based measurements rather than purely interest points has been proposed. 
This approach enhances the stability of scene image representation in the presence of 
noise and significant scaling. The robustness of the representation is further enhanced 
by employing a multiple region representation at each node of the ARG. 

The matching of scene and model ARGs is accomplished using probabilistic relaxa- 
tion that has been modified to cope with multiple scene representation. The preliminary 
results obtained in experiments with real data are encouraging. 
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Fig. 6. Test image2 and related regions 
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Abstract. This paper considers a Hidden Markov Model (HMM) for 
shape boundary generating which can be trained to be consistent with 
human expert performance on such tasks. That is, shapes are defined by 
sequences of “shape states” each of which has a probability distribution 
of expected image features (feature “symbols”). The tracking procedure 
uses a generalization of the Viterbi method by replacing its “best-first” 
search by “beam-search” so allowing the procedure to consider less likely 
features as well in the search for optimal state sequences. Results point 
to the benefits of such systems as an aide for experts in depiction shape 
boundaries as is required, for example, in Cartography. 

Keywords: Hidden Markov Models, symbolic descriptions of bounda- 
ries, predicting human performance, Viterbi Search. 



1 Introduction 

Though generating the boundary or shape of single objects seems quite simple, 
there are still no automated procedures which can reliably perform such tasks. 
On closer inspection of aerial images, for example, it can be seen that the local 
variabilities of color/intensity which experts use to infer features are difficult to 
encode by machines without additional knowledge including characteristics of 
human performance. This paper deals with this latter perspective and explores 
how Hidden Markov Models can be applied to the generation of symbolic de- 
scriptions of low-level image features such as shape boundaries in ways which 
are consistent with specific task demands and image types. 

The proposed model defines structures in terms of sequences of “shape sta- 
tes” and the proposed HMM generates such states through a model based upon: 
(1) defining shape boundaries as sequences of (boundary) “Shape States” (SS) 
that determine the local boundary structures at given positions and associa- 
ted directions for interpolation between such positions and states; (2) image 
feature extraction; (3) image feature registration as a discrete set of feature ty- 
pes: “feature symbols”; (4) learning to bind such symbols with Shape States; 
(5) the use of constrained search over neighboring image regions to find the 
appropriate feature symbols to evidence Shape States which, in turn, generate 
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(predict) the shape boundary. In the following sections a brief overview of these 
components is provided. 



2 Feature Extraction and Shape States 

For shape (boundary) encoding, it is necessary to encode feature values which 
represent intensity/color contrast, sidedness, orientation and related properties 
of the boundary. In this project we have used multi-scaled oriented approxima- 
tions to what we term “petal filters” , a variation of the “wedge filters” [Zj where 
each “petal” pair is defined about a center position, x, by: 



{Gi(x + Ui, covi),Gi{x - Ui, covi)} (1) 

as shown in Figure 1. Here, n corresponds to the number of oriented filters, x to 
the center position of the filter, u to the offset for the gaussian (Gi) center, and 
coVi to the modulation covariance defining the weighting of pixel values over the 
filter region. We choose such pairs of offset gaussians to represent the sensitivity 
to orientation information as a non-monotonic function of distance from the 
center, with the nearer distances and larger distances having less sensitivity due 
to resolution and integration window limits respectively. Specific configurations 
of such filters form the feature “symbol” codes (feature symbols: FS) which are 
indexed via the Shape States in the HMM (see Figure 2). 




Fig. 1. Petal Filters are defined by linear combinations of symmetrically offset gaus- 
sians (even) in a set of orientations tuned by sets of covariance functions. A variety 
of image properties can be directly encoded by this filter such as corners, orientation 
and, in general, color contrast. Specific types of petal spatio-chromatic configurations 
(“flowers”) form the fundamental feature symbol encoders in the current system. 



Here 8 basic shape states (SS) were used corresponding to inside and outside 
corners over vertical (Figure 2) and oblique orientations. These SSs are used to 
define our domain shape models as composed of right-angle polygons - a form 
of particular relevance to Geographic Information System (GIS) data formats. 
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Fig. 2. 8 shape states (SS) used to symbolically describe shape boundaries in terms 
of: (1) SS sequences- defined by their individual and transition probabilities; (2) the 
probabilities of the feature types (see Figure 1) given each SS - all defined in the 
W-HMM 

For this reason the proposed model is termed the “What-HMM” (W-HMM) as 
positional information is not explicitly encoded in the modefl. 



3 Generating Symbolic Descriptions 

In recent years most feature tracking models have been developed to model envi- 
ronmental exploration and feature detection in the area of robot navigation and 
models for visual “attention” in humans with particular interest in integrating 
peripheral and foveal vision jSj. Combining multi-scaled filtering, Kalman filte- 
ring(predictive mode), Hidden Markov Models and, in general, adaptive control 
models are typical of what has been iised |bl4frj . However, this type of approach 
has not been used for basic tasks like shape boundary recognition and produc- 
tion. 

More formally, the boundary tracking W-HMM is defined as followfl 
Let: T = length of the sequence of observed features (symbols); N = number 
of Shape States (SS); M = number of feature types - feature symbols (FS); 
f2x = qi, ■■■qN' the underlying SS sequence defining the shape; i?o = 
the observed FS sequence; Xf. random variable denoting the SS at time t; O^: 
random variable denoting the observed FS at time t; a = oi, ..., ot: the sequence 
of observed feature symbols. The HMM probabilities are then defined by: 

^ Although defining shapes only in terms of expected SS transitions implies an equiva- 
lence class of shapes invariant to the distances between SS’s, such SS’s are evidenced 
from feature types which, from the training data, constrain the types of expected 
positional ranges for such shapes. 

^ The following formulation is adapted from 
http: / / ftp.cs.brown.edu / research /ai/ dynamics / tutorial/Documents /HiddenMarko- 
vModels.html 
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A = Uij such that aij = Vr{Xt+i = qj\Xt = qi): the SS transition probabili- 
ties. 

B = bi such that hi{k) = Pr(0( = Vk\Xf = qi{t)): the state-conditional FS 
probabilities. 

7T = TTi such that TTi = Pr(Xo = qi)'- defines the initial or prior SS probabili- 
ties. 

The W-HMM for generating the shape boundary is defined by the five-tuple 
(Qx, We let A = {A,B,tt} denote the parameters for a given W- 

HMM with fixed fix o-nd fio ■ 

This model is motivated by (but essentially different from) earlier work of Rimey 
and Brown PI who proposed WHAT and WHERE systems for the control of at- 
tention. In their “Augmented HMM” system (AHMM) feedback was introduced 
to allow a WHERE-HMM to re-initialize new positions from the detection of, 
for example, features in the peripheral field of view. The states of their HMM 
corresponded to position movement types and the symbols to actual movements 
and their associated probabilities. The W-HMM differs from this in a number of 
ways. One, the underlying states correspond to Shape States (for example, the 8 
states defined in Figure 2) and the W-HMM encodes the relationships between 
Shape States (SS) and observed filter response types. This type of HMM has 
interesting properties which, by definition, do not involve the explicit encoding 
of positional information but, rather, what features to look for within a specified 
spatial range (“scale”) extracted from, for example, movement distance statistics 
used during training or by constrained neighborhood search. It allows for a more 
general definition of “shape” in so far as it permits the occurrences of symbols 
in more positions that what are delimited by HMMs which explicitly encode 
positional information. However this W-HMM (“What-HMM”) does require se- 
arch strategies which, themselves, could be incorporated into the W-HMM or 
explicitly encoded by an additional HMM - an extension not examined here. 



3.1 Generating HMM Inputs and Initial Estimates 

A Fuzzy version of the K-Means algorithm}^ has been used to determine the 
predominant types of features (the petal filter outputs are defined by a vector 
of “petal” values corresponding to the similarity of spatial and chromatic values 
at each petal orientation - see Figure la). 

Initial estimates for each object’s HMM model (A) probabilities are determi- 
ned from human performance statistics during training involving: (1) the obser- 
ved SS relative frequencies; (2) the observed relative frequencies of shape state 
(SS) transitions between consecutive states; (3) the state conditional FS relative 
frequencies in the observed training data. The Baum- Welsh algorithm was then 
used to update the model estimates from the observed sequence. This is a form 
of Expectation-Maximization where the current model parameters and the ob- 
served sequence are combined to determined new weighted parameter estimates. 
Following this, a new version of the Viterbi procedure PI - “ Viter bi Search” - is 
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(a) 



(b) 






Fig. 3. (a) Shows input shape, (b) Shows extracted feature cluster centroids (Feature 
States: FS) associated with the corner with petal filter outputs nearest to the cluster 
(FS). Note how these clusters reflect the major types of FS characteristics of the shape 
boundary, (c) Resultant shape tracking via the W-HMM algorithm. 



used to determine the degree to which each HMM can predict SS sequences from 
observed flower (FS) sequences. 



3.2 Viterbi Search 

The standard Viterbi algorithm is a best-first search method for finding the most 
likely state sequences which matches the observed symbol sequneceP]. The two 
basic problems with the method are that, being best-first, it is not optimal search 
neither allowing for back-tracking nor for a queue of possible combinations of 
states and symbols (SS and FS) beyond the most likely pair. The latter limitation 
is particularly relevant in sequence production where the most likely symbol may 
not be observed - but, for example, the second most likely is - so allowing the 
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W-HMM to continue propagating a predicted state sequence, given the existence 
of specific feature types in the predicted range of positions in the image. 

A more general “Viterbi Search” algorithm has been used to overcome these 
limitations. Here, at each time, a queue is constructed from the list of possible 
SSs and FSs and ordered by the products of SS and FS probabilities. This queue 
is “popped” until an FS is detected so selecting the combination of SS and FS 
at that time. 

Accordingly, for the W-HMM at RunTime, given that the system or ex- 
pert has selected a feature symbol (FS) at a given initial position (to), it must 
search for the predicted new state (s) within the search window which is de- 
fined by the pixel corridor, in this case a -k/— 5 pixels range, about the line 
formed between the current and predicted SS directions. The search direction 
is determined from the orientations of the current SS(t) and predicted SS(t-l-l). 
In all, then, the search is initiated along such paths and candidate positions 
are selected as a function of the most likely feature symbol (top of queue). 
At a given SS(t) this process searches the queue until the most likely feature 
(p(SS(t-kl)/SS(t))*p(FS(t-|-l)/SS(t-|-l))) can be instantiated through the oc- 
currence of a predicted feature at the candidate position. 



3.3 Assessing HMM Performance 

The Viterbi algorithm defines the most likely state sequence in terms of the final 
(joint) maximum probability of the state sequences given the observed symbol 
sequence. This measure is neither optimal (as the Viterbi algorithm only cor- 
responds to best-first search and not optimal search) nor sensitive enough (the 
probability value is typically very small and determined from the products of a 
large number of probabilities) to capture the degree to which the derived state 
sequence is likely to generate the specific observation sequence. For these rea- 
sons we have developed an additional method for assessing the result of the 
Viterbi search and, in turn, the degree to which the Baum- Welch procedure pro- 
duces a model representative of the observed symbol sequences. This method is 
based upon computing the Hamming distance between observed and predicted 
observation sequences using a MonteCarlo method. This is, for a given model 
we generate sequences by randomly selecting states, state transition and ob- 
servations according to the model probabilities. This is computed a number of 
times to result in a mean and standard deviation of distances between observed 
{0{t)) and predicted (P{i)) observation sequences of length T. We have used a 
normalized Hamming distance defined by: 

T 

d{P,0) = J2W^),0{t))/T ( 2 ) 

2=1 



where 



<P{P{i),0{i)) 



0 if P{i) = o{i) 

1 otherwise 



(3) 
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This results in a direct measure of the likelihood than the particular observed se- 
quence would match what can be predicted from the model without any optimal 
search for the “best state sequence” and acts as a baseline to compare with the 
Viterbi solution. The measure can also be interpreted as a simple Edit Distance 
in so far as it indexes the number of edits required to transform the predicted 
into the observed symbol sequence. 

The Viterbi procedure results in an estimate of the best state sequence. This 
state sequence can then be used to generate observation sequences using Monte- 
Carlo methods to sample the symbols according to their conditional probability 
densities - resulting in the “constrained” MonteCarlo measure of distance bet- 
ween predicted and observed symbol sequence. Comparing these values indicates 
the “gain” in using Viterbi. In the following examples we have used this latter 
procedure and measure to select and fit the model to observations. 

4 Illustrations and Experimental Results 

Our current experiment involved the detection, recognition and tracking of shape 
boundaries in three different types of images - shapes whose boundaries are 
defined by specific color contrasts, shapes embedded in additive gaussian noise 
and remotely sensed groups of houses. In each case the task is restricted to 
the detection of ordered sequences of right-angle corners from the petal filter 
outputs. 

We first illustrate how the W-HMM functions with recognition of shape cor- 
ners defined by specific combinations of colours. Figure 3(a) shows an initial 
training image where the system is given examples of corner types (labeled cor- 
ners). Figure 3(b) shows the six resultant cluster centroids and the corners which 
are closest to these values in feature space - mapped as oriented segments of a 
circle. Figure 3(c) shows how the Viterbi search procedure finds the shape bo- 
undary embedded in a quite different montage of color squares although difficult 
to detect by the human eye. This merely illustrates the core concept behind 
the W-HMM: that shape is encoded by the dependencies between specific types 
of shape states as evidenced from the detection of sequences of specific feature 
types. 

In the second experiment we focus more on the degree to which the feature 
extraction and Viterbi search methods are robust enough to activate the correct 
state sequences even when the shapes are embedded in a significant amount of 
noise. The W-HMM was trained on 7 different patterns (Figure 4(Top Left)) and 
tested on unseen patterns with various degrees of zero mean additive gaussian 
noise (Figure 4). The model estimation procedures discussed in Sections 2 and 
3 were used and, in this case they were able to perfectly generate the training 
sequences {d{P,0) = 0). Results with unseen shapes embedded in the most 
extreme noise case { a = 31.8 for each color using an 24-bit color image format) 
are shown in Figure 4. In this case performance the Hamming distance between 
observed and predicted was d{P, O) = 0.08, or, 8% of the SSs were incorrectly 
labeled. 
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92% Correct 
Corner Tracking 
Noise s =31.8 
(256 grey values; av:128) 





Fig. 4. Shows performance on unseen patterns using additive gaussian noise. 



Another example of the W-HMM involved remotely sensed images of buil- 
dings. In this case we used six different images and building types - all being 
rectangular polygons - as illustrated in Figure 5. Again, the same model estima- 
tion procedures were used on the training data. Testing on these training data 
resulted in d{P,0) = 0. On unseen buildings from three new images 57 out of a 
total of 68 corners were correctly tracked (Figure 5). 

These results point to the general feasibility of the W-HMM as an aide to 
experts when tracking shape boundaries is required since in all these experiments 
the expected number of edits on predicted SSs was less than 10%. All errors 
occurred with features which could not activate the most likely combination of 
symbol type and state transition and with Viterbi queues with more uniform 
probability densities. This is to be expected and it also follows that the more 
objective ways to define a given HMM lies in the mutual information of the A 
and B matrices and the “second-order” or self-correcting property. 
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Fig. 5. Shows examples of tracking performance on remotely sensed images. Notice 
(see, for example, (f)) how errors occur due to the Viterbi search procedure not de- 
tecting the appropriate features in the queue and so skipping to the correct state on 
another building. 
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5 Conclusions 

In this paper we have investigated some new extensions and applications of 
HMMs to shape boundary generation. In this case “shape” has been defined 
as a sequence of SSs which experts use to depict critical properties of objects. 
Results are encouraging though require extensions to the normal application 
of the Viterbi method. The specific HMM investigated here, the What-HMM, 
offers a somewhat different definition of sets of equivalent “shapes” : those whose 
sequences of p(SS)p(FS/SS) products are identical. In the cases studied here, 
this produces classes of equivalent right-angles polyhedral objects with identical 
local shape state transitions given specific types of observed features. 
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Abstract. We consider a parallel, rule-based approach for learning and 
recognition of pattern and objects in scenes. Classification rules for pat- 
tern fragments are learned with objects presented in isolation and are 
based on unary features of pattern parts and binary features of part 
relations. These rules are then applied to scenes composed of multiple 
objects. We present an approach that solves, at the same time, evidence 
combination and consistency analysis of multiple rule instantiations. Fi- 
nally, we introduce an extension of our approach to the learning of dy- 
namic patterns. 



1 Introduction 

Over the last decades, research in computer vision has concentrated on reco- 
gnizing simple, isolated objects in controlled situations, and consequently these 
systems often fail in complex, natural settings with many objects. More recently, 
researchers have realized that, in order to overcome these limitations, systems 
have to be enhanced with visual learning capabilities. Many of the learning 
techniques are investigated within symbolic, rule-based systems for recognizing 
specific and generic objects, and for recognizing events and complex scenes ^ 
Such systems are suitable for incremental generation, modular organization and 
efficient application of recognition knowledge. 

One successful approach to the learning of object recognition involves training 
a system with isolated objects in an interactive or supervised learning paradigm. 
Recognition rules are pre-compiled in the form of hashing schemes 0, interpre- 
tation tables or decision trees PE! Most of these schemes rely on attribute 
hashing of single image regions or object parts, and use relational information 
only to a limited degree for hypothesis generation and model indexing. A gene- 
ralized approach based on the use of (unary) attributes of parts and (binary) 
attributes of part relations is presented below [mni. We show that relational pat- 
tern information can be generated adaptively and used efficiently for hypothesis 
generation. 

The incorporation of relational pattern information into pre-compiled rules 
has important implications for rule application in complex scenes. It raises the 
question of how evidence from different rule instantiations should be combined, 
and, more importantly, how consistency between different rule instantiations 
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should be assessed. Consistency analysis and label updating can be easily done 
using the simple compatibility functions of classical relaxation labeling jl ,3j . but 
becomes non-trivial with complex classification rules. This is especially true when 
classification rules are applied to scenes composed of multiple objects where rules 
learned with single objects may be instantiated by pattern fragments ’’belonging” 
to different objects. To avoid misclassifications, parts belonging to the same 
object should be identified, and it has been traditionally assumed that this clique 
problem has already been solved (e.g. using perceptual grouping |S|). In contrast, 
we propose below an approach where classification of pattern fragments, evidence 
combination and the clique problem are solved at the same time. 

The rule-based approach presented here shares many similarities with ap- 
proaches based on inductive logic programming. However, the parallelism of our 
approach, both in rule learning and in rule application, is the major characte- 
ristics that sets our system apart from systems such as FOIL HU or GOLEM. 
In rule learning, our system develops trees of decision-tree, and hence belongs 
to the class of parallel covering algorithms. In rule application, our system eva- 
luates, again in parallel, all rule instations and thus is able to evaluate evidence 
combination, evidence consistency and the clique problem at the same time. 
It is this parallelism, we argue, that makes our approach feasible for learning 
complex visual data. A second major advantage of our approach is that it can 
be extended to fuzzy classifiers in a straightforward way, and experiments show 
that this can be done very effectively and efficiently 0. 

In the following sections, we first present our approach to the generation 
and compilation of recognition rules, and then we discuss application of these 
recognition rules in scenes composed of multiple objects. 

2 Learning of Spatial Patterns 

We present an approach to pattern learning, termed Conditional Rule Generation 
(CRG, P) which is based on the following idea. Classification rules for patterns 
or pattern fragments are generated that include structural pattern information to 
the extent that is required for classifying correctly a set of training patterns. CRG 
analyzes unary and binary features of connected pattern components and creates 
a tree of hierarchically organized rules for classifying new patterns. Generation 
of a rule tree proceeds in the following manner (see Fig.p: 

First, the unary features of all parts of all patterns are collected into a unary 
feature space U in which each point represents a single pattern part. The feature 
space U is partitioned into a number of clusters Ui. Some of these clusters may 
be unique with respect to class membership and provide a classification rule: If 
a pattern contains a part pr whose unary features u{pr) satisfy the bounds of a 
unique cluster Ui then the pattern can be assigned a unique classification. The 
non-unique clusters contain parts from multiple pattern classes and have to be 
analyzed further. For every part of a non-unique cluster we collect the binary 
features of this part with all other parts in the pattern to form a (conditional) 
binary feature space UBi. The binary feature space is clustered into a number of 
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Fig. 1. Cluster tree generated by the CRG method. Grey clusters are resolved (i.e. 
contain elements of a single pattern class). Unresolved clusters (e.g. Ui and U 2 ) are 
expanded to binary feature spaces (e.g. UBi and IIB 2 ), from where clustering and 
expansion continues until either all rules are resolved or the predetermined maximum 
rule length is reached 



clusters UBij. Again, some clusters may be unique and provide a classification 
rule: If a pattern contains a part pr whose unary features satisfy the bounds of 
cluster Ui, and there is an other part ps, such that the binary features b{pr,Ps) of 
the pair {pr,Ps) satisfy the bounds of a unique cluster UBij then the pattern can 
be assigned a unique classification. For non-unique clusters, the unary features 
of the second part ps are used to construct another unary feature space UBUij 
that is again clustered to produce clusters UBUijk- This expansion of the cluster 
tree continues until all classification rules are resolved or maximum rule length 
has been reached. 

If there remain unresolved rules at the end of the expansion procedure (which 
is normally the case), the generated rules are split into more discriminating 
rules using an entropy-based splitting procedure where the elements of a cluster 
are split along feature dimension such that the normalized partition entropy 
Hp{T) = {niH{Pi)+n2H{P2))/{ni+n2) is minimized, where H is entropy. Rule 
splitting continues until all classification rules are unique or some termination 
criterion has been reached. This results in a tree of conditional feature spaces 
(as shown in Fig.^), and within each feature space, rules for cluster membership 
are developed in the form of a decision tree. Hence, CRG generates a tree of 
decision trees. 

A completely resolved rule tree provides a set of rules for classification of 
patterns. Every rule in the classification tree corresponds to a sequence Ui — 
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Bij — Uj — Bjk — ■ ■ ■ of unary and binary features associated with a chain of 
pattern parts and their relations. A pattern fragment pi — P2 — ■■■ — Pn can 
instantiate a classification rule of length m completely (if n > m) or partially 
(if n < m). In the former case, the pattern fragment is classified uniquely; 
in the latter case, classification uncertainty is reduced via the empirical class 
frequencies associated with nodes of the cluster tree. 

It is important to note that the CRG algorithm is more general than classical 
decision trees, given that it develops descriptions in the form of Horn clauses C ^ 
C/i(A), i?i(A', Y), U2{Y), B2{Y, Z), . . . involving unary and relational attributes. 
At the same time, it is also more general than inductive logic programming 
approaches such as FOIL CH, given that the literals of the Horn clauses refer 
to bounded regions of continuous unary and binary feature spaces. Finally, it 
should be pointed out that CRG lends itself fairly naturally for extensions to 
fuzzy classifiers, and it has been shown that this can be done fairly effectively 
and efficiently 0. 

3 Recognition of Spatial Patterns 

CRG generates classification rules for (small) pattern fragments in the form of 
symbolic, possibly fuzzy Horn clauses. When the classification rules are applied 
to some new pattern one obtains one or more (classification) evidence vectors for 
each pattern fragment, and the evidence vectors have to be combined into a single 
evidence vector for the whole pattern. The combination rules can be learned H2|, 
they can be knowledge-guided Pj , or they can be based on general compatibility 
heuristics. In the latter approach, sets of instantiated classification rules are 
analyzed with respect to their compatibilities and rule instantiations that lead 
to incompatible interpretations are removed. This is particularly important in 
scenes composed of multiple patterns where it is unclear whether a chain pi —pj — 
... — Pn of pattern parts belongs to the same pattern or whether it is “crossing 
the boundary” between different patterns. Our compatibility analysis makes only 
weak and general assumptions about the structure of scene and objects, and is 
based on the analysis of the relationships within and between instantiated rules 
0. The learning and test situation is illustrated in Fig. |3 that shows some objects 
in isolation, and a scene composed of multiple objects. 

Initial Rule Evaluation The first rule application stage involves direct activa- 
tion of the rules in a parallel, iterative deepening method. Starting from each 
scene part, all possible chains of parts are generated and classified using the GRG 
rules. The evidence vectors of all rules instantiated by a chain S =< PiP2 ■ ■ - Pn > 
are averaged to obtain the evidence vector E{S) of the chain S, and the set Sp 
of all chains that start at p is used to obtain an initial evidence vector for part p\ 




'P 



( 1 ) 
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Fig. 2. The first two rows of show several views of objects that are used in the learning 
phase. The third row shows a scene composed of several objects, with the input image 
on the left, the segmentation result in the middle, and the classification result on the 
right (Adapted from 0) 



where #(iS) denotes the cardinality of the set S. As discussed before, evidence 
combination based on CD does not take into account the fact that some rule in- 
stantiations may be incorrect and incompatible with the rest. To the extent that 
such incompatible rule instantiations can be detected, the part classification (0 
can be improved. Compatibility analysis involves an analysis of compatibilities 
between and within chains of pattern parts. 



Inter-chain Analysis The inter-chain compatibility analysis is based on the 
following general idea: The less compatible the evidence vector of a chain Si 
is with the evidence vectors of all chains that Si touches, the more likely it is 
that Si crosses an object boundary. In this case, Si is given a low weight in 
the computation of 0 . More formally, let Si = <pnPi2 ■ ■ -Pim > and Sj = < 
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PjiPj 2 ■ ■ -Pjrij > be touching chains, and let be the set of common parts, i.e. 
Tij = {p \ 3k p = pik and 31 p = pji} with ^(Tij) > 0. The compatibility of St 
and Sj, C{Si,Sj) is defined as 

1 ^ #{M{p\S.)nM{p\S,)) 

" ' ^^^^MM{p\S,)UM{p\S,))- ^ > 



The overall compatibility of a chain Si is then defined with respect to the set 
St of chains that touch Si, i.e. St = {Sj \ #{Tij) > 0}: 



tt^inter(5'2) 






( 3 ) 



Using the inter-chain compatibility, we can now modify the original averaging 
for the part evidence vectors in ([[]) to 

E(PI = ( 4 ) 

2^SgSp ^interi*jJ 

where Sp is defined as in m- 



Intra-chain Analysis The intra-chain analysis for detecting boundary-cross- 
ing chains is based on the following idea. If a chain Si = <pnPi 2 ■ ■ - Pin > does 
not cross boundaries of objects then the evidence vectors E{pn), E{pi 2 ), ..., 
E{pin) computed by 0) are likely to be similar, and dissimilarity of the evidence 
vectors suggests that Si may be a “crossing” chain. The compatibility measure 
adopted here involves a measure of the compatibility of the evidence vector’s 
of the constituent parts with the evidence vector of the chain. This measure is 
captured in the following way. For a chain Si =< pnPi 2 ■ ■ - Pin >, 

1 " 

intra ( ^ ^ E{^Pik), (5) 

k=l 

where E{pik) refers to the evidence vector of part Pik- Initially, this can be found 
by averaging the evidence vectors of the chains which begin with part pik- 



Relaxation Scheme Taking together inter- and intra-chain analysis, our com- 
patibility measure can be used with a relaxation labeling scheme for updating 
the part evidence vectors of the following form: 

£;(‘+l)(p) = ^ E ® E{S) I , (6) 

\ SGSp / 

where ^ is the logistic function, .Z is a normalizing factor, and the binary operator 
0 is defined as a component- wise vector multiplication [a 6]^ ® [c = [ac b<3\^ . 
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For a given chain Si =< pn,Pi2, ■ ■ ■ ,Pin > of parts, the updating scheme m 
not only takes into account the compatibility between evidence vectors of all 
parts Pi but also the compatibility between the average evidence vectors and 
the chain’s evidence vector. The updating scheme (0 (together with ©, ( 0 , 
and ©) defines a (possibly fuzzy) inference procedure that can be executed 
in parallel for all parts of a scene, and that solves at the same time, evidence 
combination and consistency analysis of rule instantiations as well as the clique 
problem. 

Space 

► 

U Bij Uj 



I 1 




Fig. 3. A sketch of the overall organization of a spatiotemporal cluster tree. Spatial 
expansions are along the horizontal, temporal expansions along the vertical. The cluster 
tree shown in Fig. [H is sketched along the top row (where two unary and a (spatial) 
binary feature space are shown), temporal dependencies and expansions are shown 
vertically (where again two unary and a (temporal) binary feature space are shown). 
See text for further explanations 



4 Dynamic Patterns and Scenes 

The previous sections introduced our approach to the learning and classifica- 
tion of spatial patterns. In this section, we sketch a generalization of CRG 
into the temporal domain, CRGst, for learning dynamic (’’spatiotemporal”) 
patterns and its application to animated scenes. For a set of spatiotemporal 
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patterns Pt = {pu, . ■ ■ ,Pnt}, t = we define the following features: 

spatial unary features u{pit) (e.g. area, brightness, position), spatial binary fea- 
tures b{pit,pjt) (e.g. distance, relative size), temporal changes in unary features 
Au{pit,pit') (e.g. velocity, acceleration), and temporal changes in binary features 
Ab{pit,Pjt,Pit' ,Pjt>) (e.g. relative velocity). As before, pattern classification is 
learned in a supervised learning paradigm, and learning of classification rules 
proceeds in the following way (see Fig. EJ: 

First, the unary features of all parts (of all patterns at all time points), {pu}, 
i = 1, ... ,n, t = 1, ... ,T, are collected into a unary feature space U in which 
each each point represents a single pattern part at any time point t = 1, . . . , T. 
From this unary feature space, cluster tree expansion proceeds in two directions, 
in the spatial domain and in the temporal domain. In the spatial domain (along 
the horizontal direction in Fig. OD. cluster tree generation proceeds exactly as 
described in Section 2. Each of these feature spaces now also be expanded in 
the temporal domain by analyzing recursively temporal changes in unary {Au) 
and binary {Ab) attributes within a limited temporal window. Rule expansion 
and refinement proceeds along the same lines as discussed in Section 2 and P, 
leading to a set of rules for classifying spatiotemporal fragments, i.e. for pattern 
fragments and their changes within a restricted temporal window. 
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Abstract. This paper shows how the apparatus of robust statistics can 
be used to extract consistent estimates of surface orientation using shape- 
from-texture. We make initial estimates of surface orientation by measu- 
ring the affine distortion of neighbouring spectral peaks. We show how 
the initial orientation estimates can be refined using a process of robust 
smoothing and subsequently used for reliable curvature estimation. We 
apply the method to a variety of real-world and synthetic imagery. Here 
it is demonstrated to provide useful estimates of curvature. 



1 Introduction 



The recovery of surface shape using texture information is a process that is 
grounded in psychophysics. Moreover, it has been identified by Marr 0 as being 
a potentially useful component the 2^D sketch. Stated succinctly the problem 
is as follows. Given a two dimensional image of a textured surface, how can the 
three dimensional shape of the viewed object be recovered |2EE0B|? There are 
two contributions that deserve special attention. The first of these is the work 
of Carding |S| who has developed an elegant differential framework for shape- 
from-texture. The contribution here is to link the differential geometry of curved 
surfaces to variations in texture gradient. However, the practical realisation of 
the method has been confined to the use of artificial structural texture primitives. 
The second noteworthy contribution is that of Rosenholtz and Malik jn|. This 
is a frequency domain approach. The aim is to recover local surface orientation 
parameters which minimise a back-projection error which measures the residual 
texture gradient. However, the method relies on numerical optimisation and is 
only demonstrated on rather artificial imagery. 

One of the criticisms that can be levelled at existing shape-from-texture me- 
thods, is their failure to deliver information of sufficient acuity for reliable surface 
analysis. In a recent paper, we have addressed the analogous problem of extrac- 
ting useful topographic information from shape- from-shading Here we have 
shown how the apparatus of robust statistics can be used to refine an initially 
noisy field of surface normal estimates to recover reliable information concerning 
surface topography. The aim in this paper is to return to the shape-from-texture 
problem and show how the initial estimates of surface orientation delivered by 
local spectral analysis can be refined and used for subsequent curvature analysis. 
It must be stressed that recovery of shape from texture is more challenging than 
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the recovery of shape from shading since the process is not constrained by the 
physics of light. 

The recovery of a dense field of tangent plane orientations is a two-step pro- 
cess. The first step is to make an initial estimate of the local surface orientation. 
Here we use the eigenvectors of the affine distortion matrix for corresponding 
spectral peaks to estimate the slant and tilt directions for local tangent planes. 
Once the initial surface normal estimates are to hand, the second step is to im- 
prove the consistency of the orientation field through the use of local contextual 
information. Conventionally, this second step is realised by smoothing the esti- 
mated directions of the surface normals through a process of local averaging. 
Here we adopt a more elaborate smoothing method which has proved succes- 
sful in the shape-from-shading domain. We use robust error kernels rather than 
quadratic smoothness penalties to improve the organisation of the needle map. 
This allows us to preserve fine surface detail whilst removing the effects of local 
noise. We experiment with the new shape-from-texture on demanding real world 
images of curved texture surface. Here the method produces qualitatively good 
results. 



2 Geometric Modelling 



We commence by reviewing the projective geometry for the perspective trans- 
formation of points on a plane. Specifically, we are interested in the perspective 
transformation between the object-centred co-ordinates of the points on the tex- 
ture surface and the viewer-centred co-ordinates of the corresponding points 
on the image plane. To be more formal, suppose that the texture surface is a 
distance h from the camera which has focal length / < 0. Consider two corre- 
sponding points. The point with co-ordinates Xt = {xt,yt, Zt)'^ lies on the tex- 
ture surface while the corresponding point on the image plane has co-ordinates 
Xi = {xi,Ui, f)'^ . We represent the orientation of the viewed texture surface in 
the image plane co-ordinate system using the slant a and tilt t angles. For a 
given plane, the slant is the angle between viewer line of sight and the normal 
vector of the plane. The tilt is the angle of rotation of the normal vector to the 
texture plane around the line of sight axis. Furthermore, since we regard the tex- 
ture as being “painted” on the texture surface, the texture height Zt is always 
equal to zero. With these ingredients the perspective transformation between 
the texture-surface and image co-ordinate systems is given in matrix form by 



Xt 


f ( 


cos a cos r 


— sin T sin a cos r 




Xt 


Vi 


= 7 X \ 

n — Xt sin a 


cos <7 sin r 


COST sin CT sin T 
0 1 




yt 




— sin a 








( 1 ) 



The first term inside the curly braces represents the rotation of the local 
texture plane in slant and tilt. The second term represents the displacement 
of the rotated plane along the optic axis. Finally, the multiplying term outside 
the braces represents the non-linear foreshortening in the slant direction. When 
expressed in this way, Zi is always equal to / since the image is formed at the 
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focal plane of the camera. This transformation can be represented using the 
shorthand {xi,yi)'^ = Tp{xt,yt)"^ , where Tp is a 2 x 2 transformation matrix 
Unfortunately, the non-linear nature of the perspective transformation makes 
the Fourier domain analysis of the texture somewhat intractable. To overcome 
this difficulty it is usual to use a linear approximation of the perspective projec- 
tion m- To proceed we follow the bulk of the literature on shape-from-texture 
and use a locally affine approximation to the perspective transformation 0 . Let 
Xoi = {xot, yot, /i)^ be the location of the origin or expansion point for the local 
co-ordinate system of the affine transformation. This origin projects to the point 
{xOi,yOi, f) on the image plane. We denote the local coordinate system on the 
image plane by Xl = (cc', y[, /) where xt = x'^ + xoi and yi = y[ + yoi. The affine 
approximation is given by TA(X.oi) = J(Xoi) = J(TpXt) |(x;=o) where J(Xi) is 
the Jacobian matrix of Xi. Rewriting Ta in terms of the slant and tilt angles we 
have 






f2 

hf cos a 



xOi sin a + f cos T cos a — /sinr 
yOi sin a + f sin r cos a f cos r 



( 2 ) 



where 17 = / cos a + sin cr {xOi cos t + yoi sin r) . Hence, the affine transformation 
matrix Ta depends only on the expansion point (xOi,yOi), which is a constant, 
together with the slant and tilt angles, which are the goal of our analysis. 

We now turn our attention to how the frequency content of the local texture 
plane transforms under local affine geometry. Our starting point is a well known 
property which relates the effect of an affine transformation in the spatial domain 
to the Fourier-domain representation of a signal |H|. Suppose that G{.) represents 
the Fourier transform of a signal. Furthermore, let X be a vector of spatial co- 
ordinates and let U be the corresponding vector of frequencies. According to 
Bracewell et al, the distribution of image-plane frequencies Ut resulting from 
the Fourier transform of the affine transformation is given by 

Applying the Fourier property of Equation |^| to the linearised version of the per- 
spective transformation, the relationship between the texture plane and image 
spectra is 

Ui = TA{X)-^Vt (4) 



Here, we will consider only the affine distortion in the positions of frequency 
peaks. In other words we will not consider the distribution of the energy ampli- 
tude or phase in our analysis. For practical purposes we will use the local power 
spectrum to locate the positions of spectral peaks. 



3 Spectral Distortion across the Image Plane 

In this section we show how to make initial estimates of surface orientation 
using the results presented in the previous Section. To commence, we consider 
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the point S on the curved texture surface. Suppose that the neighbourhood of 
this point can be approximated by a local planar patch. This planar patch under- 
goes perspective projection onto the image plane. Using the result presented in 
Section 2 we make a locally affine approximation to this perspective projection. 
Further suppose we sample the texture projection of the local planar patch at 
two neighbouring points A and B laying on the image plane. The co-ordinates 
of the two points are respectively = {x,y)'^ and Xb = (a; -I- Ax,y + Ay)^ 
where Ax and Ay are the image-plane displacements between the two points. 

Suppose that the local planar patch on the texture surface has a spectral 
peak with frequency vector Us = {us,Vs)'^- On the image plane, the corre- 
sponding frequency vectors for the spectral peaks at the points Xa and Xb 
are respectively Ua = {ua,va)'^ and Ub = (ub,vb)^- Using the Fourier do- 
main affine projection property presented in Section 3, the texture-surface peak 
frequencies are related to the image plane peak frequencies via the equations 
Ua = {Ta{Xa)~^)'^Us and Ub = {Ta{Xb)~^)’^Us, where Ta{Xa) is the local 
affine approximation to the perspective projection of the planar surface patch 
at the point A and Ta{Xb) is the corresponding affine projection matrix at the 
point B. As a result, the frequency vectors for the two corresponding spectral 
peaks on the image-plane are related to one-another via the affine distortion 
Ub = {Ta{Xa)Ta{Xb)~^Y'Ua- As a result, the texture-surface spectral distor- 
tion matrix <1> = {Ta{Xa)Ta{Xb)~^)’^ is a 2x2 matrix. This matrix relates the 
affine distortion of the image plane frequency vectors to the 3-D orientation 
parameters of the local planar patch on the surface. Substituting for the affine 
approximation to the perspective transformation from Equation (5), the required 
matrix is given in terms of the slant and tilt angles as 

i? (A) -|- Aysincrsinr —Z\y sine cost 

— Z\cc sin cr sin T i? (A) -|- Aa; sin cr cos r 

where Q{A) = f cosa + sina {x cost + ysinr) and i7{B) = f cosa + sinax 
{{x + Ax) COST + {y + Ay) sinT). The above matrix represents the linear map- 
ping governing the spectral distortion over the image plane. It accounts for di- 
stortion of the spectrum sampled at the location B with respect to the sample 
at the location A. Next, we show how to solve directly for the parameters of 
surface orientation, i.e. the slant and tilt angles, using the eigen-structure of the 
transformation matrix 

Let us consider the eigenvector equation for the affine distortion matrix (p, i.e. 
<?w(A) = Aw(A), where A = (Ai,A 2 ) are the eigenvalues of the transformation 
(p and w(A) are the corresponding eigenvalues. We can directly determine the 
tilt angle from the direction of the eigenvector associated with the eigenvalue 
Ai. It can be shown that the direction of the leading eigenvector can be used to 
estimate the tilt direction using the relation 

T = arctan( ^^ ) 

(Ai) 




(B) 



(6) 
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Once the tilt angle has been obtained, the slant angle can be recovered using 
the second eigenvalue via the relationship 



cr = arctan 



/(A2-I) 

(y(l - A 2 ) - XAy) sinr + (a;(l 



A2) 



\2Ax) cost 



(7) 



The first step in orientation recovery is to estimate the affine distortion matrix 
which represents the transformation between different local texture regions on 
the image plane. These image texture regions are assumed to belong to single 
local planar patch on the curved texture surface. Suppose that U\ = (ui,vi)'^ 
represents a spectral peak estimated at the point with co-ordinates (xi,yi) on 
the image plane. Further, suppose that U2 = {u2,V2)'^ is the corresponding 
spectral peak at the point (x2,y2)- Under the affine model presented in Section 
3, the two peaks are related via the equation U2 = 'PXJi. Consequently, the local 
estimate of the affine spectral distortion matrix is ^ = (Uf)“^U2. We only 
make use of the most energetic peaks appearing in the power spectrum. That is 
to say we do not consider the detailed distribution of frequencies. Our method 
requires that we supply correspondences between spectral peaks so that the affine 
distortion matrices can be estimated. We use the energy amplitude of the peaks 
is establish the required correspondences. We order the peaks according to their 
energy amplitude. The ordering of the amplitudes of peaks at different image 
locations determines the required spectral correspondence. After estimating the 
affine transform between two local spectral peaks we can directly apply the 
eigenvector analysis described above to estimate the tilt and the slant angles. 



4 Robust Smoothing of the Needle-Map 

The orientation estimates returned by the new shape-from-texture method are 
likely to be noisy and inconsistent when viewed from the perspective of local 
smoothness. In order to improve the consistency of our needle map, and hence 
the surface shape description, we employ an iterative smoothing process to up- 
date the estimated normal vectors. However, in order to avoid the over-smoothing 
of local surface detail associated with high curvature features, we use a robust 
smoothing method. Rather than using a quadratic penalty, the error function 
uses robust error kernels, to gauge the effect of the smoothness error. The reason 
for this is that the quadratic penalty grows indefinitely with increasing smoothn- 
ess error. This can have the undesirable effect of over-smoothing genuine surface 
detail. Examples of such surface structures include ridge and ravine structures. 
By contrast, robust error kernels moderate the effects of smoothing over regions 
of genuine surface detail and allow a more faithful topographic representation 
to be recovered 0. 

We choose to use the robust smoothness penalty 



I = 






dxdy 



( 8 ) 
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In the above measure, Paiv) is the robust error kernel used to gauge the local 
consistency of the needle-map or field of surface normals. The argument of the 
kernel 77 is the measured error and the parameter a controls the width of the 
kernel. It is important to note the robust-error kernels are applied separately to 
the magnitudes of the derivatives of the needle-map in the x and y directions. 
Applying variational calculus the update equation for the surface normals which 
minimises the smoothness penalty is 
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(k) 

where J is the estimated surface normal at the pixel with row index i and 
column index j at iteration k of the smoothing process. 

As stated in Equation (28), the smoothing process is entirely general. Any ro- 
bust error kernel Pa{p) can be inserted into the above result to yield a needle-map 
smoothing process. However, it must be stressed that performance is critically 
determined by the choice of error-kernel. We have found the most effective error 
kernel to be the log-cosh sigmoidal-derivative M-estimator. The kernel has the 
functional form 



Paiv) 
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7T 



log cosh 



(10) 



5 Curvature Estimation 

Once the smoothed needle-map is to hand, then we can use the surface normals 
to estimate curvature. In our experiments, we have investigated the quality of 
the shape-index of Koenderink and Van Doom as scale-invariant measure of 
surface topography. 

The differential structure of a surface is captured by the Hessian matrix, 
which may be written in terms of surface normals as 



(feL (g)A 
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where (• ■ • )x ‘ denote the x and y components of the parenthesised 

vector respectively. The eigenvalues of the Hessian matrix, found by solving the 
equation \H — kI| = 0, are the principal curvatures of the surface, denoted Ki^ 2 - 
The shape index is defined in terms of the principal curvatures 

2 K2 + Hi , . 

(p = — arctan k\ > K 2 ( 1^) 

7T K2 — H\ 

The magnitude of the curvature is measured by the curvedness « = f 

6 Experiments with Curved Surfaces 

We have experimented with both synthetic surfaces with known ground truth 
and real-world images. The former are used to assess the accuracy of the method, 
while we use the latter to demonstrate the practical utility of the method. 

In FigureC]we show scatter plots of the ground-truth and the estimated slant 
angle, tilt angle and shape-index for a synthetic curved textured surface. In each 
case there is a clear regression line. The parameter of our spectral distortion 
method is the distance between the points used to estimate the affine distortion 
matrix on the image plane. If this distance is too small then the affine distortion 
becomes undetectable. If, on the other hand, the distance is too big then we 
sample changes in surface orientation rather than perspective foreshortening. In 
Figure El and 01 for the smoothed and unsmoothed needle maps we show a series 
of scatter plots of the shape-index returned using different values of the inter- 
point distance r (which is listed in the figure caption). The straight-line plotted 
through the data is least-squares regression fitted to the data; listed in the figure 
caption are the values of the linear regression coefficient /i fitted to the data. 
The main feature to note from these plots is that in the case of the smoothed 
version, the slope of the regression line is closer to unit, i.e. the estimated shape- 
index is in better agreement with the ground truth. In Figure0for the smoothed 
(a) and unsmoothed (b) needle-maps we plot the linear regression coefficients 
extracted from the scatter plots as a function of the interpoint distances. If the 
shape-index measurements are unbiased then the linear regression co-efficient 
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Fig. 1. Scatter Correlation Plots, (a) Slant Angle Correlation; (b) Tilt Angle Correla- 
tion; (c) Shape Index Correlation. 
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Fig. 2. Effect of neighborhood radius on shape-index for smoothed needle maps: (a) r 
= 2 fj, = -0.16 ; (b) r = 4 /r = 0.41; (c) r = 8 /r = 0.80 ; (d) r = 16 /r = 0.97; (e) r = 
32 At = 0.94; (f) r = 48 /r = 0.93; (g) r = 64 /r = 0.90; (h) r = 80 ^ = 0.85. Where is 
the linear correlation coefficient and r is the radius between two patches. 



should be unity. The main feature to note is that there is a critical value of the 
distance which results in a maximum value of the regression co-efhcient. For the 
smoothed needle-maps, the linear regression coefficient is closest to unity (0.97) 
when the interpoint distance is r =16 pixels; this represents an improvement 
over the initial unsmoothed value of /i = 0.51. For the unsmoothed needle-maps, 
the best regression coefficient (0.84) is obtained when r =48 pixels; here the 
corresponding smoothed value is /x = 0.93.. 

Finally, we experiment with real world textured surfaces. We have generated 
the images used in this study by moulding regularly textured sheets into curved 
surfaces. The images used in this study are shown in the first column of Figure El 
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Fig. 3. Effect of neighborhood radius on shape-index for un-smoothed needle-mapes: 
(a) r = 2 AX = -0.07 ; (b) r = 4 /x = 0.27; (c) r = 8 /x = 0.26; (d) r = 16 /x = 0.51; (e) 
r = 32 AX = 0.79; (f) r = 48 ax = 0.84; (g) r = 64 ax = 0.80; (h) r = 80 ax = 0.72. Where 
AX is the linear correlation coefficient and r is the radius between two patches. 
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Fig. 4. Plot of the shape index correlation in terms of the neighborhood radius, (a) 
Smoothed needle map; (b) Non-Smoothed needle map. 




Fig. 5. Real curved surfaces, (a) original image; (b) recovered needle map; (c) Smoo- 
thed needle map; (d) Shape index map; (e) Curvedness map. 
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There are two sets of images. The first group have been created by placing 
a table-cloth with a rectangular texture pattern on top of surfaces of various 
shapes. The second group of images have been created by bending a sheet of 
wrapping paper into various tubular shapes. 

The remaining columns of Figure^ from left to right, show the initial needle- 
map, the final smoothed needle-map, the estimated shape-index and the estima- 
ted curvedness. In the case of this real world data, the initial needle maps are 
more noisy and disorganised than their synthetic counterparts. However, there 
are clear regions of needle-map consistency. When robust smoothing is applied to 
the initial needle maps, then there is a significant improvement in the directional 
consistency of the needle directions. 

7 Conclusions 

We have presented a new method for improving the directional consistency of lo- 
cal tangent planes to textured surfaces. The method commences by finding affine 
distortion matrices for neighbouring points on the image plane. The directions 
of the eigenvalues of the affine distortion matrices can be used to make closed 
form estimates of the slant and tilt directions. The initial orientation estimates 
are iteratively refined using a robust smoothing technique to produce a needle 
map of improved consistency. 

The method is demonstrated on both synthetic imagery with known ground 
truth and on real-world images of man-made textured surfaces. The method 
proves useful in the analysis of both planar and curved surfaces. Moreover, the 
extracted needle maps can be used to make reliable estimates of surface curvature 
information. 
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Abstract. Two efficient approximate techniques for measuring dissi- 
milarities between cyclic patterns are presented. They are inspired on 
the quadratic time algorithm proposed by Bunke and Biihler. The first 
technique completes pseudoalignments built by the Bunke and Biihler al- 
gorithm (BBA), obtaining full alignments between cyclic patterns. The 
edit cost of the minimum-cost alignment is given as an upper-bound esti- 
mation of the exact cyclic edit distance, which results in a more accurate 
bound than the lower one obtained by BBA. The second technique uses 
both bounds to compute a weighted average, achieving even more accu- 
rate solutions. Weights come from minimizing the sum of squared relative 
errors with respect to exact distance values on a training set of string 
pairs. Experiments were conducted on both artificial and real data, to 
demonstrate the capabilities of new techniques in both accurateness and 
quadratic computing time. 

Keywords: Cyclic patterns, cyclic strings, approximate string matching, 
structural pattern analysis, 2D shape recognition. 



1 Introduction 

The problem of evaluating a measure of similarity between symbol strings arises 
in numerous applications ma, and has become a fundamental issue of structural 
pattern analysis. A number of similarities or distance measures have been pro- 
posed, many of them being special cases or generalizations of the Levenshtein 
metric 0. A commonly used generalization of this metric is the minimum cost 
of transforming a string x into a string y, when the allowable edit operations are 
symbol insertion, deletion and substitution, with costs that are functions of the 
involved symbol(s). From the analogue with the DNA sequences, this transfor- 
mation procedure is also known as alignment. The optimal alignment between 
X and y, defined as the minimum-cost edit sequence to transform x into y, is 

* This work has been supported by a grant fonnded by the Agenda Espanola de 
Cooperacion Internacional and the European ESPRIT project 30268 EUTRANS. 
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efficiently computed in quadratic time by a dynamic programming technique 
proposed, among others, by Wagner and Fisher 

There are specific applications in which, some string x is conveniently consi- 
dered as the representative of a class of strings composed by all circular shifts 
of X. These strings are named cyclic strings. An optimal alignment between two 
cyclic strings x and y is defined as one having the minimum cost of transforming 
some fixed shift of x into any circular rotation of y This brute- force approach 
takes 0 {\x\ ■ \y\^) time, where | • | represents the string length. Maes propo- 
sed an 0(|a;| • |y| ■ log |t/|) algorithm that efficiently uses dynamic programming 
properties. Another approach of cyclic string matching is presented in jn|. It 
has a theoretical cubic time complexity but, its practical execution time may 
significantly be lower. On the other hand, by sacrificing the strict optimality of 
the solution, another way to efficiently deal with this problem arises. Bunke and 
Biihler P introduced an 0 {\x\ ■ |?/|) suboptimal method which was successfully 
used for 2 D shape recognition. 

In this paper two new approximate techniques are developed based on the 
Bunke and Buhler algorithm (BBA). They are compared with BBA and Maes 
algorithm, attending to both solution optimality and algorithm execution time. 
Results show that the accuracy of the estimations of the new methods are much 
better than those provided by BBA, closely approaching the exact distance va- 
lues given by the computationally more intensive Maes algorithm. 

2 Foundations 

Let S be an alphabet and let S* be the set of all finite-length strings over S. Let 
e denote the empty symbol. An edit operation is an ordered pair {x,y) ^ (e, e) 
of strings of lengths less than 2 , denoted by a: — >■ y. For all a:;, and y^ in A, the 
function 7 assigns nonnegative real-valued costs to the following edit operations: 
substitute operation, "f{xi — >■ j/j) > 0 (it is also known as match if Xi = yj)' 
delete operation, 7(3:^ — >■ e) > 0; and insert operation, 7(e — >■ yj) > 0. 

The function 7 can be extended to a sequence of edit operations E = 6162 . . . e^, 
by defining cost of a sequence as 7(if) = Sfci 7(c*)- The edit distance S between 
strings x and y is then expressed as 

6 {x, y) = min{7(iS') | S' is an edit sequence transforming x into y} . (1) 

This computation is performed by working on a dynamic programming ma- 
trix called edit graph P defined by {\x\ + 1 ) rows and (|?/| -I- 1 ) columns, with 
time and space complexity in 0 {\x\ ■ |y|). 

3 Cyclic Alignments 

A cyclic shift cr : A* —>■ A* is defined by <j{xiX2 ■ ■ ■ a:|rc|) = x^x^ . . . Xjx\Xi- Let 
cr^ be the composition of k cyclic shifts. A cyclic string x is an equivalent class, 
denoted by [a;], defined by an equivalent relation on E*: x' = x ^ x' = a^{x), for 
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some /c G IN. Given two cyclic strings [x] and \y], the cyclic distance between 
them is defined using dO as <5c([a;], [j/]) = min{<5[cr'=(a:),cr'(2/)]| fc,ZeIN}. The 
following lemma ^ , shows a more efficient way to compute ( [cc] , [?/]). 

Lemma 1. For each representative x of a cyclic string [x] and for each cyclic 
string [y], 6(j{[x], [y]) = 6 q{x, [y]), where Sq{x, [y]) = mm{6[x , a’- {y)]\ I G IN}. 

The above lemma states that the brute-force approach to compute the cyclic 
edit distance between [x] and [y] has 0{\x\ ■ \y\^) time complexity. The problem 
we deal with in this paper is, given two cyclic strings [x] and [y] , to approximate 
S(j(x, [t/]) as well as possible in a quadratic computing cost. 

4 The Approximate Algorithm of Bunke and Biihler: A 
Non Length-Preserving Approach 

Let y^ = j/ij /2 • ■ • t/iyjl/it /2 ■ ■ ■ y\y\ be the concatenation of the string y with itself. 
The Bunke and Biihler Algorithm (BBA) P is based on the following lemma, 
which comes from the fact that the set of substrings of length |y| in is equal 
to the equivalent class [y\. 

Lemma 2. Let x and y be two strings. The distance 5(j{x, [y]) can he computed 
as 5|j,|(x,y^), where i5|y|(x,?/^) is the edit distance between x and its most similar 
substring in y^ of length \y\. 

BBA produces an estimation of 5|j,|(x,y^) by searching in the edit graph E 
defined by x and y^ the minimum weighted path from any of the starting nodes 
(0,0),(0, l),...,(0,|y|) to any of the final nodes (|x|,|?/| -k l),(|x|,|?/| -k2),..., 
(|x|,2 • lyl) without any control over the length of the paths. This leads to the 
following computation: 



mj) 

E{t,0) 

The BBA estimation <5g(x,?/) of Sq{x, [y]) is the smallest cost value among 
the set of values computed at final nodes. BBA has 0{\x\ ■ |y|) time complexity. 
In the rest of the paper, <5q(x, [y]) will be denoted as Sq{x, y). 

Figure P illustrates the two possible edit graphs to approximately compute, 
by BBA, the cyclic edit distance between the strings ba and abab. Both graphs 
show the suboptimality of this algorithm, giving the approximate values 0 and 
1 respectively, which are lower than the exact distance value 2. 



0 

E{0,j -l)+7{e^ yj-\y\) 

E{i - 1, 0) -I- 7(xi -)■ e) 

E{i - 1, j) + 7{xi e) 
E{i - 1, j - 1) +7{xi - 
E{i,j -l) + l{e^yj) 



Vj) 



Vj, 0 < 7 < |y| 

Vy, |y| + 1 < J < 2 • |y| 

Vi, 1 < i < |a:| 

Vi, 1 < i < |x| 

Vj, 1 < j < 2 • |y| 



( 2 ) 
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Fig. 1. Two possible edit graphs to approximately compute, by BBA, the cyclic di- 
stance between x = ba and y = abab: a) graph associated to x and and b) graph 
associated to y and x^. Minimum-cost alignments are marked by sequences of arrows. 
The estimation given by S-g{y,x) (1) in the graph b), where the longest string was 
vertically placed, is more accurate than the one obtained by S-g{x,y) (0). The exact 
cyclic distance value is 2 due to x = ba matches with a prefix of the rotated version 
baba from the original y = abab, and two further symbols need to be inserted. 



It was shown in references 0 and PQ that, given two strings x and y, BBA 
computes <5g(a:, y) = 6{x, y'), where y' is a substring of which is most similar 
to X, i.e. S{x,y') = u\\ti{5{x,z)\ z is a substring of y^}. The corresponding edi- 
ting path will be called pseudoalignment. The following lemma is straightforward 
from the above discussion. 

Lemma 3. The BBA estimation is a lower bound of the exact cyclic distance: 
^a;,y) < Sc{x,y) 

Significant differences between estimations given by c5g(a;,y) and S-q(jj,x) 
emerge. Let us assume that |cc| < |y|. When S-Q{x,y) is computed, y' is the 
substring of most similar to x in contents and length. Therefore, the length of 
y' may tend to be closer to |cc| and farther from |y|. In this way, the quality of the 
estimation may inversely depend on the difference |y| — \x\. On the other hand, 
in Jg(y,a;) computation, there is not a clear trend on values taken by |a;'| {x' is 
the substring of x“^ most similar to y), except a closer approximation to both |y| 
and |a;|. Consequently, the estimation given by 5g(y,a;) is expected to be more 
accurate than the one obtained by 5g(a;,y)(see Fig.Q]). Exhaustive experiments 
on both synthetic and real data, clearly confirmed this expectation. 

5 Extending BBA Psendoalignments to Fnll Alignments 

In this section, an extension of pseudoalignments built by BBA is proposed to 
create complete alignments. First, a simple mechanism is proposed to know, 
along with the 5-Q{x,y) computation, the length of the longest substring yf of 
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costs of extended 
alignments 

unaligned symbols 
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01234 5678 

baba 
0 12 3 4 

a) b) 

Fig. 2. Two possible edit graphs to approximately compute, by EBBA, the cyclic 
distance between x = ba and y = abab\ a) graph associated to x and y^ and b) graph 
associated to y and . Minimum-cost alignments to each final node are marked by 
sequences of arrows. The lower dashed rows contain the number of symbols not aligned 
with the vertical string by each minimum-cost alignment, and the upper dashed rows 
list the costs of the EBBA alignments. In the graph b), round-cap arrows represent 
situations in which alignments are pruned to avoid paths longer than |®|. Both edition 
graphs lead to the exact cyclic distance value 2. 
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that reach the final node (|ai|,j), 0 < * < 2 • \y\, among all possible substrings 
associated with minimum edition paths to this particular node. If \yi\ is known, 
the subset of |?/| — \yi\ symbols not aligned with x, which are final components 
of certain y G [y\ for which ^ is a prefix, can be easily identified. Then, by inser- 
ting such unaligned symbols to the end of the partial edit sequence, a complete 
alignment between x and y is constructed. 

In the example of Fig. Qt), the string x = ba was aligned with the substring 
^ = ba via the partial edit sequence if' = {& — ?> &, a — ?> a}. Since ji/sl = 2 and 
\y\ = 4, it can be concluded that two symbols from y (symbols b and a) have 
not been aligned with x. Then, by inserting these unaligned symbol to E' , a 
complete alignment is built, leading to if' = {6 — >■ 5, a — >■ a, e — >■ 6, e — >■ a}. 

The estimation <5gg(a;,j/) given by this extended version of BBA (EBBA) 
takes the same 0{\x\ ■ |?/|) time of BBA. Figure El shows actions performed by 
EBBA on the two edit graphs presented in Fig.lD In both situations, the approxi- 
mation given by EBBA is the exact cyclic distance between the strings involved. 

The length of each substring of aligned with x can be computed when 
the edit graph corresponding to 5-^-Q{x^y) is being built using (EJ. In fact, only 
the starting node is actually needed, and the required computation can be easily 
performed along with the standard minimization. This mechanism is also used 
to keep alignments between x and substrings of of sizes lower than |y|, by 
checking the length of each partial winner path prior to extend it (see Fig.Eb). 

Lemma 4. The EBBA estimation is an upper bound of the exact cyclic distance: 
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Proof. EBBA builds full alignments between x and some members of [y], and 
SQ{x,y) is the cost of the minimum-cost alignment between x and any y G [y]. 



6 Using BBA and EBBA as Bounds to Construct More 
Approximate Solutions 

LemmasElandEIgive lower and upper bounds of the exact solution to the problem 
of computing the distance between cyclic strings, i.e., S-g{x,y) < S(j{x,y) < 
y). In this section a new approximate solution S^jix, y) (Weighted BBA) 
is proposed, as a weighted average between the lower bound 5g(a:,y) and the 
upper bound Coefficients (weights) can be estimated by minimizing 

the sum of squared relative errors of the weighted solutions with respect to the 
exact distances. They are computed by using a training set T of string pairs 
of the problem at hand. Since both bounds can be computed simultaneously in 
0{\x\ ■ Ij/I) time, the combined solution is also computed in 0{\x\ ■ |y|) time. A 
formalization is given below: 



y) = a- < 5 b ( 2 ;, y) -b (1 - a) • ^eb(3^> v) ■ 

The approximation error is: 

2 



H 

y(x,v)^T 






The error is minimized for a value of a such that = 0: 



Ev(x.y)GT[^Bs(a;,?/) - 5 B{x,y)] ■ 



&EB(x,y) _ ^ 

Sc(x,v) 



E 



{x,y)^T 



{SEB(x,y)-5B{x,y)]'^ 
&c(x,y) 



(3) 

(4) 

(5) 



7 Experimental Results 

A number of experiments were conducted on simulated as well as on real data. 
The first series (Sect. 17.1 |l compares the quality of the solutions yield by the 
Bunke and Biihler algorithm (BBA), the extended BBA (EBBA) and the weigh- 
ted version of BBA (WeBBA), measured as the Average Relative Error (ARE) 
with respect to exact distance values. It is defined as ARE = y/ E^/P, where 
A, G {Eg, AgB, is computed as in (0) for the three suboptimal techniques, 
and P is the number of pairs of the test set. A comparison among their compu- 
ting times and the time needed by the Maes algorithm (MA) is also included. In 
Sect. I /. 21 a real classification problem is considered, where distance estimation 
techniques were used as dissimilarity measures between cyclic patterns. In all 
experiments, only the way of sorting the input string pairs in which the longest 
string is vertically placed in the edit graph is considered (see Sect.^. Exhaustive 
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Fig. 3. Comparison among BBA, EBBA and WeBBA for each set of synthetic strings 
regarding: a) the Average Relative Error (ARE) on test sets, given as percentages; 
b) the Average Computing Time (ACT) in seconds needed by suboptimal techniques 
and by Maes algorithm, to approximate and to exactly compute a cyclic distance, 
respectively. 



experiments not presented in this paper show that, using this way of conside- 
ring input strings, BBA achieves a significantly better estimation to the exact 
solution and consequently, it can give us a more precise criterion of how better 
the proposal techniques (EBBA and WeBBA) are with respect to BBA. The 
value assigned to each positive-cost edit operation is 1 for all experiments. All 
the algorithms were implemented in C programming language, and experiments 
were performed on a 166 MHz Intel Pentium MMX with 128 Mb of RAM. 

7.1 A Comparison among BBA, EBBA and WeBBA 

Experiments on Synthetic Strings. Six pairs of training and test sets with 
70 randomly drawn strings in each one, were generated from a uniform distri- 
bution law. The symbols of the strings were chosen from an alphabet composed 
by six symbols. To guarantee different average lengths of strings among pairs of 
sets, individual lengths were also randomly drawn following a uniform distribu- 
tion law in the ranges [5,15], [15,25], [30,50], [65,95], [135,185] and [280,360] 
respectively. From the 70 strings of each set, a set of 2415 different non-ordered 
string pairs was built. For each of the six training sets, a corresponding optimal 
value of a was computed using 

Results are presented in Fig. 0in terms of the approximation error (ARE) 
on test sets and the Average Computing Time (ACT) of estimating a cyclic 
distance as a function of the average string length. 

A very important error reduction is achieved by WeBBA with respect to 
previous approaches, mainly for longer strings. On the other hand, computing 
times of all suboptimal methods are similar and are dramatically smaller than 
that of the exact Maes algorithm. 

The optimal values of a computed from each training set were 0.431, 0.316, 
0.316, 0.276, 0.339 and 0.368 respectively. All these values were lower than 0.5, 
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Fig. 4. Chicken piece images from different classes. 



assessing the fact that EBBA makes notably better estimations than BBA. In 
this way, the value of a represents for each training set, a reliable index of the 
relationship between the accuracies of the BBA and EBBA. To better appro- 
ximations of EBBA with respect to BBA corresponds lower values of a. On 
the other hand, the estimations given by these suboptimal techniques are data- 
dependent with respect to, for example, the lengths of the involved strings, or 
with respect to the size of the alphabet of symbols. In this particular example 
with synthetic data, different values of a were obtained from each training set. 
This is due to the fact that the average lengths of the strings members of the sets 
are notably different, which leads to different relations between the accuracies 
of the estimations given by BBA and EBBA, respectively. 

Experiments with Chain-Code Representations from Silhouettes of 
Chicken Pieces. The previous experiment was repeated with a set of chain- 
code contours describing silhouettes of chicken parts. A set composed by 446 
images from chicken pieces was used 0. Each piece belongs to one of five cate- 
gories, which represent specific parts of the chicken: wing (117 samples), back 
(76), drumstick (96), thigh and back (61), and breast (96). Each image is in 
binary format containing a silhouette from a particular piece. Pieces were placed 
in a natural way without considering orientation. All images were adequately 
clipped and scaled into 64x64 pixels images (see examples in Fig.^. A standard 
4-direction chain-encoding procedure |2| was applied, and the resulting chain- 
code contours were re-encoded into rotation-invariant representations where new 
codes specify relative change of angles as a function of line length m- 

From these 446 chain-code strings, two sets of 100 (one for training and 
one for test) were randomly picked, preserving the frequencies among different 
classes. From each string set, an associated set composed by 4950 different non- 
ordered string pairs was built. The training set of pairs was used to compute the 
optimal value for a using Q . Table Q shows the computed value of a and the 
approximation error (ARE) of BBA, EBBA and WeBBA on both the test set 
and the training set. In this case, WeBBA obtained an ARE reduction of 40.43% 
with respect to EBBA on test set. 
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Table 1. Approximation error (ARE) of BBA, EBB A and WeBBA for rotation- 
invariant chain-code representations of silhouettes of chicken pieces. 



ARE of WeBBA 

CX. 

on Training Set 


ARE of WeBBA 
on Test Set 


ARE of EBBA 
on Test Set 


ARE of BBA 
on Test Set 


0.328 1.73% 


2.24% 


3.76% 


7.88% 



7.2 Classification Experiments with the Chicken Pieces Data Set 

In this section a classification experiment on the data set composed by chain- 
code representations from chicken pieces is presented. The Levenshtein non- 
cyclic edit distance (ED), the BBA, the WeBBA and the Maes procedure were 
used to compute dissimilarities between cyclic patterns. These dissimilarities 
were normalized by the sum of the lengths of the two cyclic patterns involved. 
Classification error rate was estimated for all the techniques, through a “leaving 
one out” scheme with the 1-Nearest Neighbor classification rule using the 446 
samples. Since WeBBA needs a training set for estimating the value of a, the 
data set was partitioned into four subsets keeping the frequencies among classes 
for this particular case. From each subset i, 0 < i < 3, a related set of all different 
non-ordered pairs of strings was built. It was used to compute the value of the 
corresponding using (0. To classify a contour of the subset i the coefficient 
niod 4 was used. In this way, each cyclic pattern did not take part in the 
computation of the coefficient a used in its classification. 

Table 12 shows the classification error rate and the average time needed to 
make a decision for each technique. The WeBBA is as accurate as the Maes 
optimal algorithm with respect to the error rate, and at the same time, it has 
been almost as efficient as BBA and ED, when computation time was considered. 

8 Conclusions and Further Work 

In this paper two efficient techniques based on the Bunke and Biihler algorithm 
(BBA)[P have been proposed to approach the cyclic edit distance between two 
strings. They have been compared attending to the quality of their estimations 
and the computing time with respect to BBA and the Maes algorithm (MA). 



Table 2. Classification results on chicken pieces data set. 



String-to-string 

Technique 


Error 

Rate 


Average Time (in seconds) 
to Make a Decision 


ED 


32.51% 


2.90 


BBA 


24.66% 


6.01 


WeBBA 


21.97% 


9.69 


Maes 


22.65% 


56.49 
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The first technique (EBBA) extends pseudoalignments built by BBA to com- 
plete alignments by inserting all the symbols not considered by BBA. The second 
one (WeBBA) consists of a weighted average between the estimations given by 
BBA (lower bound of the exact value) and EBBA (upper bound), to achieve a 
significant better approximation. Weights are those which minimize the sum of 
the squared relative errors of the weighted solutions for a training set. 

Experiments on both synthetic and real data, shown that better approxi- 
mations are achieved by WeBBA and EBBA with respect to BBA, keeping its 
same quadratic computation time, and far from the time cost of MA. Synthetic 
experiments showed that estimation errors given by suboptimal techniques tend 
to decrease as lengths of strings grow, specially for WeBBA. A classification 
experiment was also carried out, in which WeBBA had a very good behavior 
on both accurateness and computing time, with respect to other string-to-string 
techniques. This approximate technique (WeBBA) seems to be a very attrac- 
tive approach to deal with classification tasks based on cyclic string matching, 
because it constitutes a good trade-off between accuracy and computation time. 

Another worth mentioning point concerns the possibility of computing values 
for the coefficient a, which optimize criteria different from accurateness. For 
example, if the problem has well-defined classes, different a values could be 
learned for each class according to some criterion related with classification error. 
To classify a new sample, distances to each class are computed by considering 
the values of the parameters for this specific class. 
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Abstract. We propose a randomized method for the detection of sym- 
metry in planar polygons without assuming the predetermination of the 
centroids of the objects. Using a voting process, which is the main concept 
of the Hough transform in image processing, we transform the geometric 
computation for symmetry detection which is usually based on graph 
theory and combinatorial optimization, to the peak detection problem 
in a voting space in the context of the Hough transform. 



1 Introduction 

In pattern recognition, the symmetry of an object is an important feature be- 
cause symmetry provides references for the recognition and measurement of ob- 
jects. The symmetry information enables the speeding up of the recognition pro- 
cess and also the reduction of the space required for storage of the object models. 
The symmetry properties of objects yield valuable information for image under- 
standing and compression. For the utilization of the symmetry properties, it is 
necessary to determine the symmetry axes or shape orientations. 

In this paper, we propose a symmetry detection method based on the random 
sampling and voting process. An object is said to be rotationally symmetric 
if the object, after being affected an transformation, becomes identical to the 
original object. If point set V is an n-fold rotationally symmetric object with 
n > 3, this point set V will be identical to itself after being rotated around 
the centroid through any multiple of The voting process converts direct 
geometric and analytical computation of features from data to the peak detection 
problem in a voting space. The method proposed here is an extension of our 
randomized method for motion detection, which we proposed previously PI2|. 
In our previous papers HP we derived algorithm for both 2D and 3D motion 
estimation without the predetermination of point correspondences. If an object 
is rotationally symmetrical, the result of a rotation which is determined by the 
symmetry derives the same shape, that is, there are ambiguities in the solutions 
obtained from the motion detection algorithm if the object has symmetry axes 
and occlusion is not considered. In this paper, we use these ambiguities for the 
determination of the symmetries of planar polygons. 

Many methods have been proposed to determine the object orientation, such 
as those involving principal axes P|, reflection-symmetry axes and universal 
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principal axes |5|- However, these methods are not suitable when the shape is 
rotationally symmetries. Lin 0 proposed a method for the determination of 
shape orientations by the fold-invariant introducing the concepts of the fold- 
invariant centroid (FIC) and the fold-invariant weighted mean (FIRWM). In 
this method, the rotational symmetry of a shape is defined as the direction of 
the unique half-line which begins from the centroid and passes through FIC and 
FIRWM. The number of folds n of a given rotational symmetry shape can be 
determined by string matching technique JZ]. Lin et al 0 proposed a method for 
the determination the number of folds based on a simple mathematical property. 
Recently, Lin |0| also proposed a modification of his previous method in which 
the matching procedure is discarded. Additionally, we can find other approaches 
such as the proposed by Yip et al who use the Hough transform method to 
determine the rotational symmetry of planar shapes. 

The motion analysis algorithm which we proposed in references 0 

detects motion parameters for planar planar and spatial motions without any 
assumption for the point correspondences among image frames. Therefore, if an 
object is rotationally symmetrical, this algorithm yields ambiguities of soluti- 
ons. Using this fundamental property of the motion analysis algorithm based on 
the random sampling and voting process, we construct a common framework for 
the detection of symmetry of both planar and spatial object. Our algorithm does 
not require the predetermination of centroid since our motion analysis algorithm 
does not require centroid. Never the less, our algorithm detects the centroid after 
detecting symmetry of an object for both planar and spatial objects. Further- 
more, the algorithm detects both rotation symmetry and reflection symmetry for 
planar objects, simultaneously. Moreover, the algorithm estimates the centroid 
of polyhedrons from surface information using symmetry. These properties are 
the significant advantages of our new algorithm for the detection of symmetry. 
In this paper, we assume that point set V is a polygonal set on a plane. Further- 
more, we assume that points on the boundaries of these point sets are extracted 
and sampled with an appropriate method. 



2 Symmetry and Transformation 

Symmetry indicates the congruence of an object under transformations. Here 
we assume Euclidean transformations. The presence of an axis of symmetry in 
an object is considered as the existence of rotational or reflectional symmetry. 
In this paper, we only consider rotation symmetry and the number of axes of 
rotation symmetry. The order of rotation symmetry is called the folding number 
for planar figures. 

For a set of vectors V in Euclidean space, we define F(V) = {y\y = Fx,\/x G 
V} for linear transformation F. Let 17 be a rotation matrix such that 17™ = I, 
for an appropriate positive integer m such that m > 2. Setting g to be the 
centroid of V, we define a set of vectors Yg — {x\x G V} for x — x — g. Setting 
U^{y) = {y\y = U^x,x G V}, if V = U^(V), then V has a symmetry axis 
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with respect to g. Then, V is n-rotation symmetrical, and we call n the folding 
number of an object V with respect to the axis of rotation k. 

Let the rotation matrix on two-dimensional Euclidean plane to be 



U = 



cos — , — sin — 



sin — , 



cos ^ 



( 1 ) 



for a positive integer n. Setting = {y\y = U’^x} for k = 1,2, if 

= Vg, then V is rotation symmetrical and the folding number of V is n. 
Furthermore, for orthogonal matrix M and rotation matrix R such that 



M = 




/ cos 9, — sin0\ 
\^sin0, cos 9 J ’ 



(2) 



setting a set M to be the result of the application of matrix M and rotation R{9) 
to Vg, that is, M = {y\y = R{9)Mx}, if the equality M = Vg is satisfied, 
then point set V is reflectionally symmetrical with respect to a line such that 
{x — g) = 0, for u)-^ = (— sin |,cos which passes through g and 
parallel to vector u) = (cos |,sin 



3 Motion Analysis by Sampling and Voting 

Setting {xaja^i and to be points on an object on Euclidean plane 

R^, which are observed at times ti and t 2 , respectively, such that ti < t 2 , 
we assume that for arbitrary pairs of a and (3, X/j and are connected by 
Euclidean motion If we do not know the point correspondences between frames, 
the motion parameters R and t, which are a rotation matrix and a translation 
vector, respectively, are obtained as the solution which minimizes the criterion 

E = nun - (Ra;„ + t)\, (3) 

<7rt,v 

where cr(o;) is a permutation over 1 < a < n and R and t are a rotation matrix 
and a translation vector, respectively. 

Rotation symmetry and the folding number of an object define point cor- 
respondences with respect to the rotation axes. Therefore, if we detect point 
correspondences, we can determine symmetry and the folding number with res- 
pect to an axis of rotation of an object. Since the random sampling and voting 
method for the motion analysis detect both motion parameters and point cor- 
respondences concurrently, we apply this method for the detection of symmetry 
of an object. 

Assuming that t = 0, motion analysis algorithms detect the rotation of an 
object, if an object is rotationally symmetrical, the result of a rotation which 
is determined by the folding number n derives the same shape. Therefore, if we 
apply the motion analysis algorithms to an object which is rotationally symme- 
trical, we can obtain all rotation matrices U^, that is, 

min \ya(a)-iR^a + t)\= min - (U’^Rx^ + t)\, k=l,2,---,n. (4) 

cr.R,t (T,R.t 
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Therefore, our algorithm detects all U^R for k = 1,2, where R is the 

true rotation matrix. However, all estimated matrices have the same translation 
vector. Using this ambiguity, we detect symmetry axes and the folding numbers 
of spatial objects. After a sufficient number of iterations, this algorithm detects 
all rotation matrices such that [/"(V) = V. 

Figure 1 shows two hexagons which are connected by motion equation. Ho- 
wever, if we do not know any point correspondences as the solution of motion 
analysis algorithm we have 

R = fi(27rfc/6)fi(27r/12), fc = 1, 2, • • • , 6. (5) 



Therefore, there exist an ambiguities of solutions for rotationally symmetrical 
objects. Here, the number 6 of R{2tt/6) is the folding number of the hexagon. 
Therefore, the ambiguity of the solutions derives the folding number of an planar 
object. Furthermore, if a planar shape is reflectionally symmetrical, two shapes 
V and M{y) are the same shape. Therefore points on V satisfy the relation 

min \Va(a)- + = k=l,2,--- ,n. (6) 

a,R.t 

for an appropriate rotation matrix U. This geometric property leads to the 
conclusion that by applying our motion analysis algorithms, it is possible to 
detect the reflection axes if we detect all peaks in the accumulator space. 

Let {xi = be a set of points which is moving on a plane. For 

Xij and iCfc, 



Xik {Xi Xp^^Hi Vk) 5 ^ik Vki iXi ^fc)) j (^) 



which is orthogonal to are invariant under a Euclidean motion of a set of 
points. Furthermore, 



'^ik — 



^ik 

\Xtk\' 



= 




(8) 



form an orthogonal basis. Moreover, in the same way we define y^f,, yjj., Vik, 
vjj. for the second image frame. These two sets of orthogonal base derive the 
orthogonal expansions of vectors 



^jk = u,k + al^ ujj,, y^f, = + (ifk^fk, 






(9) 



where 



4k = ^]kU^k, a\k =^Jkutk, !3lk = v]kVtk, Plk =y]kvik- 



(10) 



Setting Xj and y^, 7 = i,j,k, to be noncollinear triplets of points in the 
frames 1 and 2, respectively, if these triplet of pairs are coresponding points 
which satisfy the same motion equation, the equations 

k^lk=f^ik^ \Xpq\ = \ypq\^ 



= f^ik^ 



( 11 ) 
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where p ^ q for p, q £ {i,j, fc}, hold. Moreover, using these corresponding tri- 
plets, we have 

-R= ■ (12) 

Although random sampling finds pairs of triplets of vectors which hold relations 
in eq. 0 , these relations do not conclude that a pair of triplet vectors are 
connected by a Euclidean transform. Therefore, we solve this problem using the 
voting procedure, since voting procedure collects many evidences and inference 
the solution. 

If a pair of triplets {xi, Xj,Xk} and {j/j, , y;,}, is connected by the relation 

y,^- = R{9)Mx,j, (13) 

the coefficients defined eq. (tTIIIl satisfy the relations 

\Xpq\ = IVpql, (14) 

where p ^ q, for p, q G {i,j, fc}. Therefore, applying motion motion analysis 
algorithm to a pair of vector triplets {xi,Xj,Xk} and {iWyj, Afy^-, Afy^.}, we 
can detect the direction of the line of reflection as uj. 

As shown in Figure 2, our planar algorithm first determines a triangle in each 
frame, and second computes the rotation parameter between these two triangles. 
In Figure 2, the algorithm determines the angle between line segments 13 and 
1'3', if we assume vertices i and i' correspond for * = 1,2, 3. Although there are 
many possibility for the combinations of triangles and point correspondences, 
for hexagons in Figure 2, the number of vertex-combinations which determine 
the rotation angle 

27T 27T 

6»= — ,fc= l,2,---,6, (15) 

larger than the number of combinations of edges which determine the angle such 
that 

O-TT 27T 

— ,fc = l,2,---,6, (16) 

for 2x2 rotation matrix R{9). If we apply motion analysis to an object by 
sampling a pairs of simplexes, then 9 — for fc = 1, 2, • • • , 6. Then we can 

determine the folding number of planar polygonal objects. 



4 Detection of Symmetry 

In the followings, we assume that the boundary points on an object are extracted 
using an appropriate methods. As we mentioned in the previous section, our 
motion analysis algorithm based on the random sampling and voting process 
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' ► 

Fig. 1. Tow hexagons related by an Euclidan motion. 




Fig. 2. Trainagls and point correspondences on two hexagons. 



detects all rotation matrices U^, for k = 1, 2, • • • , n if the folding number of an 
object is n, that is, an object V is n-rotationally symmetry. 

The parameter 9c, such that — tt < 6c < 'k, which is computed by 

R{ 9 c) = {vi2,vi2){ui2,u^2V ( 17 ) 

is an estimation of the rotation angle for a planar object. Therefore, our accumu- 
lator space for the detection of rotation angles is a finite linear array equivalent 
to the interval [— tt, tt], which is equivalent to [0, 27 t]. Setting 9 min to be the smal- 
lest positive values which possess a peak in this accumulator space, the folding 
number is j 9 min- Therefore, the detection of the folding number is achieved 

by detecting peaks in accumulator space. We detect the folding number applying 
the cepstrum analysis in the accumulator space for the folding-number detection, 
since the peak distribution along cells in the accumulator space is considered as 
a periodic function (m. This property derives the following algorithm for the 
detection of folding numbers of planar shapes. 

Algorithm for the detection of the folding number 

1 Compute the DFT (the Discrete Fourier Transform) of scor{9) using the 
FFT (the Fast Fourier Transform). 

2 Compute the power spectrum of scor(9) from the result of Step 1, and set 
it as S{n). 

3 Compute the logarithm of the power spectrum of S{9) from the result of 
Step 2, and set it C(n). 

4 Detect the positive peak of C(n) for the smallest n and set it as n*. 

5 Adopt n* as the folding number. 
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Once detecting rotation matrix it is possible to determine the corre- 

spondences of points on the boundary of an object for a motion 

y = R{2 tt / n)x + t. (18) 

Therefore, we can determine t and the centroid of an object by 

g = {I — R{2Tr/n))~^t. (19) 

Therefore, for the detection of the folding number , we need not to prepare 
vectors whose centriod is the origin of the coordinate system. 

As mentioned in the previous section, if we apply motion motion analysis 
algorithm to {xi,Xj,Xk} and {My^,MypMyf.} we can detect the direction 
of the line of reflection as u>r- For the minimum values in the accumulator space 
which is computed by 

R{9r) = (vi 2 ,v^^) {M{ui2,uj^2)V > ( 20 ) 

a reflectionally symmetrical object satisfies the relation Yg = R{9r){M (V g)) . 
Therefore, the reflection axis of an object is line (o;;!-)^(a; — g) = 0, for vector 
a;;)* = (— sin cos ^)^. This line passes through the centroid of Xi and y^ and 
perpendiculat to vector Xi — y^- 

For a simple closed polygonal curve on a plane, the folding number is equi- 
valent to or less than the number of vertices of this polygon, since matrix R(^) 
transform vertices to vertices. Therefore, the folding number satisfies the rela- 
tion 2 < n < v{n), where v{n) is the number of vertices of a polygon. Setting 
h to be the folding number detected by R{9r), we call n the number of folding 
for reflection since number n determines the number of reflection axes. If an 
object is reflectionally symmetrical and rotationally asymmetrical h = 1 and 
n = 1. Therefore, using n and n, we have the following classification criterions 
for planar objects. 

1. If n = 1 or n > v{n), and h yf I then an object is asymmetrical. 

2. If n = 1 or n > v(n), and h = I then an object is reflectionally symmetrical. 

3. If 2 < v{n) < n, and h yf I then an object is rotationally symmetrical and 
is not reflectionally symmetrical. 

4. If 2 < v{n) < n, and fi = n then an object is rotationally and reflectionally 
symmetrical. 

5 Computational Results 

For a planar object V, let k and m be the folding number and the total number of 
sample points. We assume that m points are separated into k independent subset 
whose number of sample points is n. Therefore, k, m and n satisfy the relation 
kn = m. For this object, the total number of combinations for the selection of a 
pair of triplets is mCs Xm C 3 . The number of combinations for the selection of 
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a subset is kC 2 , and the number of combinations for the selection of triplets of 
sample points in each subset is mCk- 

For a pair of triplets of sample-points, if thay are congruent triangles, the 
rotation angles between them is one of ^ for g = 1, 2, • • • , /c. This means that the 
number of combination of pair of triplets of sample- points which is combined 
by angles which are determined by the folding number is the same as the folding 
number. Therefore, setting p to be the probability that a selected triples is not 
collinear, the probability for the correct selection of noncollinear triplets is 



1 kC2 x« C's 

r = p . 

k rn Cs ^ m Cz 



(21) 



In the following, we deal with the case that 0{m) = 0{n) and 0{k) = 10 
Therefore, assuming that fc <C m, we have P~^ = O(m^) . Next, setting N 
and e to be the total number of iteration and the threshold for the detection of 
peaks, respectively, P, N, and e which is e = 0{m^) satisfy the relation NP > e. 
Therefore, we have N > In the following examples, we set N = 10^ 

for m = 50 and s = 4. 

During the computation of 0c and 0r, we also compute centroid g and line 
— g) = 0, simultaneously. Furthermore, using the imaging plane on 
which the original object is expressed as the accumulator space, we vote 1 to 
vector g and the line for the estimation of the centroid and the reflection axes. 
Using this procedure, we can superimpose the centroid and the reflection axes 
on the original object. 

From figure 3, we show the numerical results of symmetry analysis for planar 
objects. Subflgures, (a), (b), (c), (d), (e), (f), (g), (h), and (i) show input data, the 
peaks in the accumulator space for the detection of rotation, the cepstrum of the 
peak-distributions for the rotation, the peaks in the accumulator space for the 
detection of reflection, the cepstrum of the peak-distributions for the reflection, 
the peaks for the detection of the centroids, the peaks for the detection of the 
reflection axes, the centroids, the reflection axes, and the input objects, and the 
reconstructed objects by folding the input using folding numbers for rotating 
inputs. In these figures, small square blocks in figure (a) show sample points. 
For the performance analysis of the algorithm, we assumed that the boundaries 
of objects are extracted using appropriate algorithm and that finite numbers of 
samples are extracted from boundaries. Here, the total number of sample points 
on the boundaty is 54, and the times for iteration is 10^. 

A point set V, which is n-rotationally symmetrical, satisfies the relation 



V = U («7“(Vg © M)) , (22) 

a=l 



where © expresses the Minkowski addition of two sets of points. This expres- 
sion of the rotationally symmetrical object implies that if the algorithm is stable 
against occlusions, for partially occluded objects, we can reconstruct the com- 
plete object by first rotating the object around the estimated centroid by 2nk/n 
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k = 1, 2, • • • , n and second superimposing all of them. In figure (i) shows recon- 
structed object using this property of the rotationally symmetrical object. This 
process is possible because our algorithm is stable against occlusions. 





(g) (h) (i) 



Fig. 3. Numerical examples 



6 Conclusions 

In this paper, we developed a randomized algorithm for the detection of rota- 
tional symmetry in planar polygons without assuming the predetermination of 
the centroids of the objects. Our algorithm is simple because we converted the 
matching problem for the detection of symmetry to peak detection in a voting 
space. This result showed that the voting process is a suitable approach to sim- 
plify matching problems. The numerical stability of the random sampling and 
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voting process for motion analysis is disucussed in our previous papers [II 1211 1 j . 
We also determied the size of cells in the accumulator spaces for the detection 
of motion parameters using this analysis. 
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Abstract. A model based structural recognition approach is used for 3D 
detection and localization of vehicles. It is theoretically founded by syntactic 
pattern recognition using coordinate grammars and depicted by production nets. 
The computational effort significantly depends on certain tolerance parameters 
and the distribution of input data in the attribute domain. A brief theoretical 
survey of these interrelations is accompanied by comparing the performance on 
synthetic random data to the performance on data from different natural 
environments. 



1 Introduction 

In structural computer vision the computational effort often depends on the data. 
Investigating such interdependencies therefore is an important issue. For the 3D 
detection of man-made objects in images model knowledge can be represented by 
e. g. productions, frames or semantic networks [9], Utilization of knowledge is 
commonly understood as a search for corresponding objects in the data. Bottom up, 
top-down or mixed strategies are used for structural approaches. A* search [10] may 
serve as a well known example. Some heuristic evaluation function is used, that 
assesses the maximal or probable merit of intermediate results with respect to the final 
goal of complete model to data correspondence. There are tasks that hardly permit the 
formulation of such a function. 

Vehicle recognition from oblique and very oblique (nearly horizontal) views is an 
example for such a task. In contrast to aerial vertical views [6,15] size and aspect are 
very variable. Also radiometry and contrasts of the target object and other objects in 
the background or foreground are hardly predictable. Some variations are displayed in 
Fig. 1. It is difficult to define preferences or exclusions for intensities, contrasts, 
positions, directions, sizes etc. We propose a structural approach using a complete 
bottom-up part-of analysis. This approach competes with mutual information methods 
[4] and some quite similar but probabilistic methods based on generalized cylinders 
[1]. Since our approach leads to high computational effort we propose to use rather 
simple well scaling methods and structures. Therefore, the assessment of worst-case 
and probable efforts and the verification of such assessments on relevant data are a 
worthwhile endeavour. 
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Fig. 1. Ground-based images of different vehicles (VWBUS-PICKUP). a) Scene 1: Object 
distance ~20m, sunny, visibility mediocre, b) Scene 2: Object distance ~130m, diffuse, clear 
visibility, c) Scene 3: Object distance ~320m, diffuse, dull visibility 



Computational complexity analysis is common practice in other related pattern 
recognition disciplines like e.g. labeling line drawings of polyhedral scenes [11], 
geometric hashing [17], or structured methods based on volumetric primitives and 
aspect graph matching [2], but has not jet been challenged in our section of 
syntactically inspired structural methods. 

Section 2 shortly recalls production net definitions, methods and implementations. An 
example net is given in section 3. Using this example the effort assessment method is 
discussed in section 4 and practical results are given in section 5. 



2 Production Nets for Object Recognition 

Most symbolic methods in pattern recognition deal with structures like strings, trees, 
arrays or graphs. Production net theory is based on coordinate grammars and thus 
simply uses sets [7,8]. The productions work on sets of instances (s,d) consisting of a 
symbol ssTuN from a finite set of terminals and non-terminals and a numeric 
attribute vector dsD from a domain which usually contains coordinates, orientations, 
surface normals etc. Pairs, triples, etc. of such instances are called configurations. 



2.1 Production 

Productions consist of a condition and an action part. The condition part gives a 
predicate defined on the input configuration. The action part gives a function 
calculating the output configuration (usually a single object). A simple example is 
given by 

{{LlNE,LlNE\n)^ {ANGLE) (1) 

Objects of type LINE have image coordinates and orientations as attributes. The 
condition demands a pair (LINE, LINE) fulfilling tt which defines ’adjacent and 
rectangular’ with some necessary tolerances. Function ^calculates the intersection of 
the straight lines corresponding to the input configuration. This coordinate is needed 
as attribute value for the new object ANGLE. 
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Generally a production contains left and right words Z and A of symbols from T lM 
with at least one non-terminal in A. We write \Z\ for length of the word 2" and {Z} 
for its corresponding multi-set. 



2.2 Production Net 

Production nets display the interaction of several such productions in a system. As 
graphs they resemble Petri nets. The set of nodes is given by the set of symbols 
(object types) and the set of productions. An edge leads from an object type to a 
production, if the condition part contains it. If it is contained multiply, it is drawn 
multiply. An edge leads from a production to an object type, if it produces it. 
Examples for production nets are published in [12,13,14,15]. 



2.3 Model Knowledge for Vehicles 

There are several possibilities for modeling vehicles geometrically. One may e.g. use 
articulated 3D models. The projection may also be included in the geometric model, 
so that finally 2D views - or linear combinations of these - are matched like in [16]. 
Such modeling may be used, if the camera is directly approaching the target object. 
Otherwise stereo methods and 3D matching with articulated models are preferred. For 
the statistic discussion in this context we refer to a hierarchically organized shape 
fixed model of a little truck already known from [7,8]. 



2.4 Implementation 

Our Implementation is based on a blackboard shell named BPI [13]. Each production 
defines a separate processing module containing condition test and action part. All 
modules work on a common memory. They insert new instances, but they do not 
delete the instances of the input configuration. Thus the system works accumulating 
instead of replacing. Such irrevocable control facilitates the processing of large data 
sets at moderate effort scaling [10]. The accumulation method serves as 
approximation of the semantically correct replacement and backtrack method [7,8]. 
Associative memory aids the reduction of effort scaling [13]. 



3 Example Production Net 

As example vehicle we choose a small six seated truck and named the corresponding 
object type VWBUS-PICKUP. Fig. 2 shows the production net designed for the 
recognition of such objects in very oblique image sequences accompanied by informal 
sketches of the meaning of intermediate object types. This has been published before 
in more detail [7,8]. 
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Two sub nets may be distinguished according to the dimension of the attribute 



domain. The main model-to-data match i 
net contains some simple standard 
productions for line prolongation, 
corner and U-structure composition. 

It is executed on each image 
separately and linked to the 3D sub 
net with the stereo production p4. 

The extraction of line segments 
from the images has been described 
in [12]. 

Provided each 3D part required by 
the net is visible in at least two 
images (the sequences used consist 
of eight frames) a lot of occlusion is 
tolerable. Invariance of the detection 
result is given with respect to a large 
variety of aspects and distances and 
with arbitrary back- and foreground 
objects. Due to the deep part-of 
hierarchy and 3D model use false 
detection is very unlikely. But a 
high detection rate requires 
generous tolerance parameters in the 
conditions, which becomes 
computationally expensive. 



4 Statistical Effort Assessment 



implemented in the 3D suh net. The 2D sub 
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Fig. 2. Production Net VWBUS-PICKUP 



If S- denote the object types a standard production like p3 in Fig. 2 may be written as 

(53) ( 2 ) 

We denote the set of all corresponding input configurations fulfilling tt as 3p and 
define the relative volume Vp as the ratio between | 3p | and the size of the set of all 
possible configurations. The latter is given by the attribute domain and the number of 
objects in the input configuration. 
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( 3 ) 



This gives a measure for the degree of restriction provided by a production. If e. g. n 
be 'parallel' in an orientation domain o = {0°,...,179°j with tolerance §o = ±9°. Then 
we get Vp = 0.106 Often relative volumes will result from a product, because n is 
composed as conjunction of conditions on independent attributes. If e. g. additionally 
to 'parallel' also 'adjacent' is required with some tolerance of 10 pixel in Euclidian 
metrics in an image of IM pixel size, we get 






19 314 

^ = 0.000033 . 

180 10 ° 



( 4 ) 



Thus small relative volumes result from high dimensional attributes, narrow 
tolerances and many independent conditions connected as conjunction. 



Provided a random process generates sets of instances Sj and of the object types s, 
and Sj with known distribution in D an expectation may be calculated for the number 
of instances of S 3 reduced by p (Eq. 2). Equally distributed attribute values in for 
instance give a Poisson distribution with parameter A, = | | Vp for the number of 

partners in fulfilling ;rtogether with a fixed instance of Sj [5]: 

1 k (5) 

P( No. of Partners = k) = — e A. 

k\ 



Expectation value for this distribution is A,. We neglect that in rare cases the same 
instance Sj may result from different input configurations. We assume independence 
of the instances s, from instances s^. Then ISJA is an expectation for the number of 
instances S 3 resulting from p and we get 

Such equations may be constructed for any production pj in any net: 

A ; with V „ = 

J ”J 



h,d))^E(is|)= nE(i5,i)h„ 



Pi- 






5E{A,}and 5-|(5,rf); — A 



3 . 






( 7 ) 



Cycle free production nets provide an order 0(s) on the object types given by the 
length of the longest path leading from a terminal to s. Eor such nets the expectation 
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equations are used with ascending O to calculate all E(ISI) up to the goal type using 
the sums 



e(i^i)= z, nfd^jK, o) 

and starting with the known distributions for the terminals. 

The probable overall demand for memory is then given by the sum of all these 
expectations. For the probable computational effort of a full bottom up search we 
have to weight each sum with the computational costs caused by an instance of its 
symbol. This is a constant amount mainly consisting of its construction effort plus the 
costs for the queries it causes because of the -Tin which it appears. All this can be 
calculated in advance. 



5 Experiments 

Synthetic random data as well as data from real outdoor images are used to verify the 
relevance and precision of the calculations presented above. Production p6 of the 3D 
sub net has been applied to equally distributed random generated sets of instances O 
with varying sizes and thus densities. The attribute domain here contains four 3D 
coordinates and one surface normal. Table 1 gives the set sizes. 



o 


1040 


2196 


4592 


9021 


12835 


25141 


50684 


101243 


E 


2 


12 


47 


227 


425 


1639 


6638 


25165 



Table 1. 3D Statistics -Random instances O and generated instances E (p6) 



The set size of the set of instances E grows quadratic with the set size of the set of 
instances O. Vp6 has been estimated at =10'’ according to the size of the attribute 
Domain (3D coordinates in 500^ and surface normal) and tolerances (±50 in max- 
norm for 3D coordinates and ±Q.3rad ). The data in Table 1 yield a quadratic 
parameter of 2.710’^. Such differences result from imprecision in the theoretic 
calculations (for instance neglecting special properties at the rim of coordinate spaces 
or estimations with linearization of orientation manifolds). We regard A>1 as critical 
values, because the desirable monotone decrease of set sizes with O will be violated. 
An attribute domain of the given size in the example should therefore not contain 
more than 370000 instances O. 

Fig. 3b-d show natural input data extracted from the images in Fig. 1. The distribution 
of instances O resulting from such image sequences are rather unequal. Dense clusters 
and large nearly empty zones occupy the attribute domain (here 2000^). E. g. in scene 
1 (Fig. 3b) 25427 instances E are constructed from 12997 instances O. Consequently 
the mid density of instances in the overall domain is of less relevance for the effort 
assessment compared to the density in the clusters (which is much harder to be 
measured or estimated). 
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Fig. 3. Instances L from images of Fig. 1 (sections 600x400 Pixel), 
a) Data Set Random 2 (1400x700 Pixel), b) Scene 1 (1400x700 Pixel), 
c) Scene 2 (3200x1600 Pixel), d) Scene 3 (6400x3200 Pixel) 



Differences between effort statistics of synthetic random data and real data are less 
significant with 2D productions. Columns 1-3 in Tab. 2 confirm the predicted 
polynomial growth of set sizes with the polynomial degree depending on O. Natural 
data still give different characteristics. Scenes 1 and 2 for instance yield significant 
minima at object type A not present in the random data. The system tends to make 
background suppression at this stage (see Fig. 4). Like in 3D mid density is not the 
most important feature (columns 3 and 5 are similar in this parameter). A more 
important contribution is given by things like structure and lighting. Scene 2 for 
instance has a lot of man-made straight lines and high contrast rectangles in it 
resembling the structure to be detected and thus poses much more challenge than the 
more blurred and less structured scenes 1 and 3. 



Type 


Random 1 


Random 2 


Random3 


Scene 1 


Scene 2 


Scene 3 


L 


4318 


8598 


17185 


11299 


74253 


84952 


LL 


884 


3349 


11893 


5704 


85047 


46952 


A 


359 


5168 


64677 


2455 


47368 


34284 


U 


59 


4084 


185801 


6185 


154076 


52131 



Table 2. 2D Statistics - Set Sizes for Object Types L, LL, A, and U 
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Detection success and failure of the system is not the topic here (has been published 
in [7,8]). Short: Correct instances T result from scenes 1 and 2, whereas scene 3 
yields no instances T. 





Fig. 4. Instances L, LL, A, and U of data set Random 2 and Scene 1 (sections) 



6 Discussion 

Principally target objects may be modeled from terminal objects using arbitrary 
partial objects. For instance a rectangle may be constructed from four lines using 
angles as well as parallels as intermediate objects. If knowledge about expected 
background structures is given (e.g. major orientations), then the corresponding 
structural relations should be avoided in the low order productions of the net (e.g. 
parallel). Figure 3c shows long straight contours from furrows and right angled 
structures similar to the ones present in the target model. In such cases high 
computational effort on background objects can not be avoided. 

Certainly the terminal objects extracted from images of natural scenes will not be 
equally distributed. For the terminal object sets displayed in Fig. 3 the distribution of 
the attribute orientation is shown in Fig. 5. In ground based images with man-made 
structure vertical and horizontal lines may dominate (Fig. 5b,c). In vehicle detection 
tasks the majority of the terminal objects stem from arbitrary structures in the 
background or foreground, about which nothing is known. In such situation equal 
density and independence assumptions inherent in the investigations of Sect. 4 are 
appropriate. However, if the distribution of an attribute is given, the simple 
calculation of the expectation in Eq. 6 will have to be replaced by explicit integration. 
For example, for a production constructing parallel pairs of lines a significant peak in 
the orientation histogram will rise the expected number of constructed objects. 
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Fig. 5. Histograms of the attribute orientation (0°-179°) of instances LL. 
a)-d) corresponding to Fig. 3 a-d 



Some structural relations allow the assessment of their relative volumes by displaying 
corresponding search regions. Fig 6 shows examples: Adjacency in vector spaces with 
a maximum metric and a threshold parameter simply gives an interval (Fig. 6a), a 
square (Fig. 6c), or a cube (Fig. 6f). Note, that the volume of such regions grows in a 
polynomial way with power D (the dimension of the domain). Thus fairly small 
changes in the threshold parameter of a 3D structural relation may have severe 
consequences on the computational effort. Topologically more complicated are 
relations on orientation attributes. The second column shows the examples line 
orientation (Fig. 6b), surface orientation (Fig. 6d), and 3D rotation (Fig. 6g). The 
exact calculation of the relative volume of the structural relation ‘similar in 3D 
rotation’ with the same threshold in all three angles (Fig. 6g) requires techniques from 
differential geometry. At 



least for small angles 
volume growth with 
power D will still be 
present. But there are 
relations, where the 
power of growth will be 
less than the dimension. 
Fig. 6e,h show the search 
regions corresponding to 
adjacency of a line. The 
size of these regions 
grow linear in the 2D 
case and quadratic in the 
3D case. The length of 
the rectangle or cuboid is 
fixed by the length of the 
line. 



1D •- 



2D 



3D 



/ 




f 

Fig. 6. Search regions for important structural relations 



We presented a method for the assessment of the computational effort caused by the 
analysis of images by a production net. Dependencies on tolerance parameters and 
densities of instances in the data become evident. These calculations provide valuable 
quantitative information for the overall system design. Comparisons of the effort 
between the presented systems and other ones are difficult, because they are not 
available. Success and effort also strongly depend on the task, the model and the 
images used. A comparison of the effort and stability of different approaches requires 
re-implementations. Subject of ongoing work is e.g. the implementation of aspect 
based vehicle recognition. 
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Abstract. In this paper we propose a new method to relink graph py- 
ramids by local relinking operations in an iterated parallel way. By re- 
presenting graph pyramids as bases of valuated matroids, the goal of the 
relinking is expressed by a valuation on the corresponding matroid. This 
valuation guides the local relinking operations. The valuation attains its 
global maximum if none of the local relinking operations yields higher va- 
lues. The new method is used for an adaption of graph pyramids towards 
having a given receptive field. 



1 Introduction 

To perceive an image is to transform it (SEH^. In order to allow a clear distin- 
ction between transformations of image structure and transformations of image 
contents, we first represent the image as an attributed graph forming the base 
level of a graph pyramid. A common way to construct the base level graph is to 
create a vertex for each pixel and to let the edges represent the 4-connectivity of 
the pixel array. The attributes of the vertices, edges and faces are derived from 
the gray values or colors of the pixels. The other levels of the pyramid are for- 
med by subsequent dual graph contractions |Kro95aj controlled by application 
defined models. A local function, the so called reduction function, derives the 
attributes of the current level from the level below. In all levels the attributes 
represent the image contents, while the structure of the image is given by the 
graph without the attributes. The graphs on the higher levels of the pyramid 
yield more and more abstract descriptions of the underlying image. However, 
the construction of the graph pyramid should not be restricted to a bottom-up 
procedure. The alternatives as given by a model usually induce constraints on 
neighborhoods in the graph pyramid. Holding to the separation of structure and 
contents we extend the influence of the model by allowing 

1. relinking of the pyramid without adjusting the contents, 

2. contents adjustments and classification without relinking. 

* This work has been supported by the Austrian Science Fund (FWF) under grant 
S7002-MAT. 
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These transformations are also utilized to increase the robustness of the pyramid. 
This paper is devoted to efficiently perform the relinking by iterated parallel 
transformations (IPT) ySch97| . A variable linking of regular pyramids was first 
described in IfiHksil . An extension to irregular hierarchies of graphs is shown 
in |1N ac95| . IPT for contents adjustment and classification, i.e. relaxation, has 
been applied to hierarchies of graphs in [wnns]. Since dual graph contraction is 
an IPT towards abstraction, the IPT considered so far can be organized in the 
triangle depicted in Fig. 01 

The paper is organized as follows. Section El is devoted to the construction of 
graph pyramids by dual graph contraction. In Section 01 we arrive at a definition 
of local relinking operations on graph pyramids. The definition is based on the 
representation of graph pyramids as bases of matroids. Section 01 introduces 
valuations on matroids. The valuations are utilized to guide the local relinking 
operations. In Section Owe apply the relinking to the adaption of graph pyramids 
towards having a given receptive field. We conclude in Section 0 



2 Dual Graph Contraction 

The construction of graph pyramids by dual graph contraction (see Fig. El is 
described in jKro95a,). Let Go = {Vo,Eq) and Gq = (Vo,Eo) denote a pair of 
plane graphs, where Gq is the dual of Gg. Dual graph contraction consists of two 
steps: dual edge contraction and dual face contraction. Dual edge contraction is 
specified by a subset Fq of Eq, such that the edges of Fq form a spanning forest 
of Gq. The trees of the spanning forest are referred to as contraction kernels. 
In Fig. 01 the non-trivial contraction kernels are emphasized. Each contraction 
kernel Tg of Fq is contracted to one vertex v\ of the graph G\ = (Vi,i?i) on 
the next level of the graph pyramid. For each vertex ug of Tg the vertex vi is 
called parent of vq and vq is called the child of v\. Each edge of Ei corresponds 
to exactly one edge in Tg, which does not belong to a contraction kernel. Let Tg 



Abstraction 



Hierarchy 



Classification 




Structure 



Contents 



Fig. 1. Iterative parallel transformations on graph pyramids. 
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(a) (Go, Go) (b)(Gi,Gi) (c)(G2,G2) 

Fig. 2. Dual graph contraction. 



denote the set of edges in So, which are dual to the edges in Fq. Set Ei := Eq\Fo 
and Gi := {Vq,Ei). Note that Gi and Gi form a dual pair of plane graphs. 

The second step, called dual face contraction, is specified by a subset Si of 
Si. The edges of Si are required to form a spanning tree of Gi. In Fig. Eb the 
edges of Si are emphasized. Analogous to dual edge contraction, we generate 
G2 and set G2 := (Vi,S2) with S2 := Si \ Si. Each vertex in G2 has exactly 
one child in Gi, i.e. the vertex itself. The graphs G2 and G2 form another dual 
pair of plane graphs. In jKrofl.'oa,) the role of dual face contraction is confined to 
the removal of faces bounded by less than three edges. In the following we will 
drop this restriction in order to apply the theory of matroids in a general way. 
Subsequent parallel edge [face] contraction steps may be summarized by a single 
edge [face] contraction step. 

Each vertex in the graph pyramid represents a connected set of base level 
vertices, the so called receptive field. The receptive field of a base level vertex 
contains exactly the vertex itself. For each vertex Vk on the level fc > 1 the 
receptive field RF(vk) is defined by all vertices in the base level of the pyramid 
which lead to Vk by climbing the pyramid from children to parents. In Fig. 0 
the odd levels are omitted. 

RF{vo) = {uo} for vq € Vq, 

RF{vk) = [j{RF{vk-i) I Vk-i is child of Vk), k> 0. 

Note that the receptive fields in the graph pyramid do not overlap, since 
all vertices (except the apex) have exactly one parent. 

3 Representation of Graph Pyramids as Bases of 
Matroids 

Let Go and Go denote a pair of plane graphs and assume P = (Go, Gi, . . . , G2n) 
and P = (Go, Gi, . . . , G2n) to be graph pyramids constructed on top of the pair 
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Fig. 3. The vertices forming the receptive field of v are enlarged. 



(Go, Go) by dual graph contractions. We also assume that the apex G 2 n is a 
graph with one vertex and zero edges. Let Gi = (Vi,Ei) for all 0 < f < 2n. 
The edge set Eq is required to be non-empty. The domain of all graph pyramids 
with the above properties is denoted by T>{Go, 2n). For each edge e £ Eq let l{e) 
denote the maximal level of V which contains e, i.e. 

l{e) := max{l \ e £ Ei\ Ei+i}. (1) 

The construction of the graph pyramid is determined by the above assignment 
of labels from L := {0, 1, . . . , 2n — 1} to the edges in Eq (similar to [KroHhbj i. 
The assignments are expressed by subsets of Eq x L. Let B denote a subset of 
Eq X L. We set 

E^{B) := {e £ Eq \ 3j with (e, j) £ B and j = 0 mod 2}. (2) 

If B = {{e,l{e)) I e £ Eq}, where /(•) refers to the construction of a graph 
pyramid, then 

— ye £ Eq exists exactly one I with (e, 1) £ B and 

— E^{B) forms a spanning tree in Eq. 

Conversely, let B G Eq x L. If B fulfills the above two items, then B defines a 
labeling of edges from ifg, which describes the construction of a graph pyramid. 
This follows from the fact, that E^{B) forms a spanning tree in Eq if and only 
if 

E^{B) := {e £ Eq\ 3j with (e, j) G B and j = 1 mod 2} (3) 

forms a maximal edge set in Gg, which is not a cutset |T^ (a cutset of a 
connected graph G is a minimal set of edges of G such that its removal from G 
disconnects G). Hence, the edges in Gg, which are dual to E^{B) form a spanning 
tree E^{B) in Gg. In conjunction with the labels from B, the spanning trees 
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E^{B) and E^{B) define the contraction kernels for the dual edge contraction 
and the dual face contraction respectively. 

Let B denote the collection of all subsets of Eq x L which describe construc- 
tions of graph pyramids in T>{Go,2n). Note that S yf 0 because of Eq yf 0. The 
following theorem states an exchange property for sets in B. 

Theorem 1. Let B, B' G B. For each b € B \ B' there exists b' G B' \ B such 
that B\{b}U {b'} G B. 

Proof: It suffices to show that E^{B \ {6} U {&'}) forms a spanning tree of Gq 
or that \ {6} U {&'}) forms a maximal non-cutset of Gq. 

• Case b = (e, 1) with Z > 0: In the fundamental circuit |T^ of E^{B') U {e} 
there exists e' ^ E^{B) (since E'^{B) contains no cycles). Let V G L denote the 
unique number with (e', V) G B' and set b' := (e', I'). Since e' ^ E^{B), it follows 
that e' yf e. This implies b' ^b and (because of e' G E^(B') U {e}) e' G E^(B'), 
i.e. I' > 0. Since e and e' belong to the same cycle of E^{B') U {e} and have 
positive labels, it follows that E'^{B \ {6} U {&'}) forms a spanning tree of Eq. 

• Case b = {e,l) with I < 0: The set E^{B') U {e} forms a cutset of Gq. Since 
E^{B) contains no cutset, there exists e' G E^(B') U {e}, e' ^ E^{B). Let I' G L 
denote the unique number with (e',/') G B' and set b' := (e',Z'). Since e' ^ 
E^{B), it follows that e' y^ e. This implies b' ^ b and (because of e' G E^(B') U 
{e}) e' G E^(B'), i.e. I' < 0. Since e and e' belong to the cutset E^{B) U {e'} , 
it follows that if^(B \ {6} U {&'}) is a maximal non-cutset. □ 



Definition 1. For B G B,b G B,b' ^ B the mapping modif{B,b,b') := B \ 
{6} U {5'} is called local modification of B, if modif{B, b, b') G B. 

The sets in B, the so called bases, determine a matroid M := {Eq x L,I) on 
Eq X L, where 

I-={IcB\BgB} (4) 

I Pxl92| . Thus we may write A4 = A4{B). In |Hriifi9| the exchange property of 
Theorem ^ is extended: 

Theorem 2. Let B denote the collection of bases of a matroid and let B, B' G B. 
For each b G B\B' there exists b' G B' \B such that 

• B\{b}U {b'} G B and 

• B'\ {b'} U {6} G B. 

Theorem 0 implies that any B G B can be adapted to any other B' G B hy local 
modifications only. 

If the construction of V is determined by B, each local modification of B 
induces an operation on V. We define: 

Definition 2. An operation on a graph pyramid V is called local relinking ope- 
ration, if it is induced hy a local modification on a matroid base that describes 
the construction ofV. 
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(a) Two compact sets T and RF with (b) Structural similarity of T 

Hausdorff distance 1. and RF 



Fig. 4. Metric and structural comparison of receptive fields. 



4 Valuated Matroids 

In order to utilize local relinking operations for the adaption of a graph pyramid, 
the choice of the operations has to be determined by the goal of the adaption. We 
represent graph pyramids as bases of matroids and use a definition in mm, 
where R denotes, for example, the set of reals or the set of integers. 

Definition 3 (Valuation on a Matroid). A valuation on a matroid A4 = 
M{B) is a function to: B ^ R which has the following exchange property. For 
B, B' G B and b G B \ B' there exists b' G B' \ B such that 

- B\{b}U {b'} G B, 

- B'\ {b'} U {6} G B, 

- uj{B) + uj{B') < uj{B \ {b} U {6'}) + lu{B' \ {b'} U {b}) . 

A matroid equipped with a valuation is called valuated matroid. 

The following theorem |DW9nj implies that valuations on matroids can be ma- 
ximized by local modifications. 

Theorem 3. Let B G B and let uj be a valuation on the matroid M = M{B). 
Then oj(B) is maximal, ifuj{Bm) < for all local modifications B^ of B. 

In order to utilize Theorem 0 for the adaption of graph pyramids by local re- 
linking operations, we have to find a valuation on the corresponding matroid, 
which is maximal if and only if the goal of the adaption is reached. Then we 
apply a local relinking operation whenever it increases the valuation. 
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(b) Adaption guided by uii 
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(c) Adaption guided by u>2 and a>3 



(d) V 



Fig. 5. Relinking towards a given receptive field. Modified edges are highlighted. 



5 Adaption of Graph Pyramids 

In this section we use valuated matroids to adapt a graph pyramid V towards 
having a receptive field equal to a given connected set T of vertices from the base 
level of V. If there is no receptive field equal to T, we may still ask: How well 
does T fit into the pyramid This question has a narrow metric and a wider 
structural aspect: If there exists a receptive field RF in V with a small distance 
(Hausdorff-distance for example) to T, we say that T fits well into V. The wider 
structural aspect is the following: Can a good fit of T into V be achieved by only 
a few (including zero) local relinking operations on V? This case is illustrated 
in Fig. Ef), where splitting off the receptive field K from RF yields T. 

In the following, we will apply local relinking operations to the graph pyramid 
P, such that one of its receptive fields becomes equal to T. In Fig. Et and 5d the 
pyramid V and the adapted pyramid V' are illustrated by their receptive fields. 
The set T is given by the filled circles. 

Since T is contained in the receptive field of the apex of V, there exists a 
smallest receptive field of V which covers T completely. In particular, there exists 
a vertex vff" in V such that T C RF{v^'") and T ^ RF{v) for all children v 
of u™”. If T = RF(v^'") no adaption of V is needed. Otherwise structural 
modifications are needed only in the subpyramid of V, whose apex is 

As explained in Section 0 we may describe the adaption of V by local modi- 
fications on the corresponding matroid base B. The set Eq of edges in the base 
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level of V is partitioned by the edge sets E^{B) and E^{B), as defined in 0 
and Q- The edge sets E^{B) and E^{B), in turn, are partitioned with respect 
to T into three classes each. For i S {0,1} we set: 

- E\{B) := {e = (u,u) S E\B) \ {u,u| C Tj, 

- Ei,{B) :={e={u,v)&E^{B)\{u,v}^T = %}, 

- El{B) ■.= E\B)\{E\{B)VJE1,{B)). 

Adapting V towards containing T as a, receptive field, we focus on the following 
edges in E^{B): 

Definition 4. An edge e = {w,z) G E^{B) conflicts with T, if 

- e G E^{B) and 

- one end point of e is contained in RF{v^'^) \ T. 

Theorem 4. The graph pyramid V has no receptive field equal to T V has 
edges conflicting with T. 

Proof of Theorem HI 

=>: Assume that no receptive field of P equals T, i.e. RF{vtj?^) 3 T and 
RF{v^'") T. The set of all edges from E^{B) with both end vertices in 

RF{v^'") forms a spanning tree of RF{y^'") and thus contains an edge e = {w, z) 
with w GT and z G RF{v?f") \ T. The edge e conflicts with T. 

<=: Let e = (w,z) be an edge conflicting with T. Without loss of generality we 
assume z G RF^vif") \ T. It follows that RF{v^'") yf T. If there was a receptive 
field in V equal to T, RF{v^'") would equal T, a contradiction. □ 



5.1 Algorithm for the Adaption 

The adaption of V towards containing T is done in three steps, all of which 
reduce the number of edges, which conflict with T : 

1. The number of edges in E^{B) is increased without affecting edges in E°(B). 

2. The number of edges in E^iB) is increased without affecting edges in E°{B). 

3. The labels of the remaining edges conflicting with T are raised. 

In order to perform the first two steps, we define valuations oji and 0 J 2 - The 
matroid base B is a subset of Eq x L. An element x of B can be written as 
X = {ex,l{ex))- For 6x ^ E^{B) let C{B,€x) denote the fundamental circuit of 
E^{B) U {cx} and set 



Isiex) ■= max{l{e) \ e G C{B,Cx),e ex}- (5) 

Let Cy G C{B,ex) with Cy y^ ex,l{&y) = Isi^x)- In |Kro95bj it is shown that 
the graph pyramid defined by B \ {(cy, l{ey))} U {{ex-,lB{e-x))} equals the graph 
pyramid defined by B. For i G (1, 2} we set uji{B) := vali{x) with 

( 1 : Cx G Ej{B),l{ex) = Isiex) 

val(x)-=\ ^ ^ exGE°iB)UE^,{B) 

I -I : exGEf{B)UEl{B),l{ex)^lB{ex) ^ ’ 

I 0 : otherwise 
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Consider the case i = 1 first. The value 1 is given for labeled edges, which we 
want to insert between vertices of T. The same value is given for labeled edges 
that we do not want to change anymore. The valuation loi{B) is maximal only 
if the edges in E^{B) form a spanning tree of T. In the case i = 2 the roles 
of T and the complement of T are reversed. Finally, the levels of the remaining 
edges conflicting with T are raised to the highest even label lace an edge between 
vertices of RF{v?f“) can have. These local relinking operations are guided by 
the valuation lo^{B) val^{x) with 

( 1 : ex€ El{B)UEl{B) 

t)a;3(a;) := < -1 : € Ef{B),l{ea;) Lee (7) 

[ 0 : otherwise. 

Note that each local modification (guided by wi, lu 2 or wa) reduces the number 
of edges conflicting with T by exactly one and raises the valuation by exactly 
one. The effect on the receptive fields can each time be described as detaching a 
part of RE{v^'"). These parts are fully determined by the edges conflicting with 
T. 

5.2 Example 

Fig. 03 shows that there are exactly two local modifications which raise the 
valuation uji. The total increase of wi thus amounts to 2. Fig. Et shows that 
0 J 2 and W 3 can be raised by 2 and 1 respectively. The comparison of Fig. 0i 
and Fig. 01 yields that none of the receptive fields completely contained in T or 
completely contained in the complement of T have been modified. 

6 Conclusion 

Valuations on matroids were shown to be capable of guiding the relinking of 
graph pyramids by local relinking operations. Furthermore, the local relinking 
operations may be performed in an iterated parallel way. We suggest the new 
method for tracking and motion analysis. In conjunction with dual graph con- 
traction and contents adjustment it is also suggested for graph based object 
recognition. 
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Abstract. In this paper, a Hierarchical Entropy based Representation 
for texture indexing HERTI is presented. The hypothesis is that any tex- 
ture can be efficaciously represented by means of a 1-D signal obtained 
by a characteristic curve covering a square (uniform under a given crite- 
rion and a given segmentation) region. Starting from such a signal, HER 
can be then efficaciously applied, taking into account of its generality, for 
image retrieval by content. Moreover, a Spatial Access Method (SAM), 
i.e. k-d-Tree, has been utilized in order to improve the search performan- 
ces. The results obtained on some databases show that HERTI achieves 
very good performances with few false alarms and dismissals. 

Keywords: Content Based Retrieval, Entropy, Textures, k-d-Tree. 



1 Introduction 

In the last years, in a growing number of applications, images constitute the 
main kind of data to be acquired and processed. In medicine, for instance, the 
possibility of producing databases containing images relative to clinical cases is 
fundamental as support to make decisions [Q. In particular, a lot of attention has 
been devoted to efficacious representations to obtain an approximated retrieval 
by content pnj . So, many techniques are devoted to describe shapes and, more 
in general, objects contained in a pictorial scene as signals peim . while other 
techniques are oriented to analyze textures, color and other features of interest 
for a specific case p. 

In this paper we are interested in studying the second class, and in parti- 
cular, to deal with images containing textures. It’s well known, in fact, that 
there are many fields (medicine, cultural heritage and so on) where an efficient 
analysis of the textures is of great importance to characterize the knowledge 
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contained in the images. On the other hand, HER (Hierarchical Entropy ba- 
sed Representation), which is a very useful representation for 1-D signals m. 
has been successfully used for the features extraction, achieving very promising 
results. So, the aim of this paper is to combine these two aspects in order to 
attain an efficacious retrieval on texture-based images. This leads us to define a 
characteristic curve, as explained in detail in Section 2, which allows us to see 
an intrinsic 2-D problem — the texture characterization — by means of a 1-D 
representation able to be processed by HER. In the following we’ll denote this 
technique by HERTI {HER for Textures Indexing). The results obtained, perfor- 
ming HERTI on many databases seems to be very promising, with a significant 
performance improvement over other techniques. 

The rest of the paper is organized as follows. A short review about HER’s 
theoretical formulation is presented in the first part of Section 2, where it is 
outlined the link between a given 1-D signal and its HER. The problem of de- 
scribing the micro structure of a given texture as 1-D problem constitutes the 
topic of the second part of Section 2. In Section 3, the experimental results are 
presented, providing to test HERTI performances in terms of the Normalized 
Recall^ well-known in literature. Finally, Section 4 gives the conclusions. 



2 A Hierarchical Representation 

In this section we give some details about HER (Hierarchical Entropy based 
Representation) which is a useful representation to represent a given signal ^ 
1^. The underlying idea of this representation is to obtain a subset of the 1-D 
signal samples (that is its local maxima), and the associated energy. 

Looking at this representation, two interesting aspects go on: 

— the first one is its generality, so that it can be used whereas a 1-D problem 
is required; 

— the second one is its ability to efficaciously describe a signal using few co- 
efficients. In other words, it is so strongly hierarchical since it provides to 
renstitute the energy distribution of the signal under study with respect to 
the maxima. 



Such a representation has been utilized to describe textures in a very par- 
ticular way, and this will be described in the next Section after a short review 
about some fundamentals theoretical concepts of HER. 



2.1 A Review about HER 

Starting from a mono dimensional, time-discrete and finite signal x{n) (i.e. 
x{n) 7 ^ 0 for n G [0, A — 1]), the absolute maximum can be define as follows: 
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( 1 ) 



where the operator Ai is: 



Ai = x{i) — x{i — 1). 



( 2 ) 



Now, we can consider the gaussian function having Xi as maximum and 
standard deviation a{x{i)): 

cr{x{i)) = , (3) 

where: 



Er{x{i)) 



/ E{x{i)) 

^E-E{x{i)) 



)E{x{i)) 



is the relative energy, i.e. weighted by the total energy of the signal: 



(4) 



N-1 

E=J2\x{i)\^. (5) 

i=0 

The introduction of the foregoing gaussian function allows us to define the 
entropy associated to xf. 



S{x{z))= (6) 

j=i+cr(x(i)) 

which can be interpreted as a sort of energy of the signal x in the range 
R= [i — cr(x(i)), i + cr(a;(i))] (see Fig. 1). 

When the signal has k maxima, we can define the entropy of whole signal as: 

k 

S = ^S{x{i)). (7) 

i=l 

Starting from the array containing the x's (k) maxima: 

X = {xi,..,Xk} : xi>X2> >Xk ( 8 ) 

the signal y is built as follows: 

ik 

y= U T(Rr)Gi^(Rr). 

ij —ii 



(9) 
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Gi^ {Ri^ ) — Si- _ 


(10) 


R~=R,.~ {Ri. n IJ i?y) 


(11) 


t<j 




= (thefwise 


(12) 



y is uniquely determined by the first 2k elements of the array y: 

2k 

, ^ V 

y = ■ • ,0 . (13) 

' V ^ 

N 

and this latter represents the HER (Hierarchical Entropy Representation) of 
the signal x. 

In order to compare two given signals xi,X 2 , we can utilize the Euclidean 
distance between the corresponding signals yi,y 2 , or, equivalently between the 
arrays yi e j/ 2 , computed as follows: 



Ni 

D{y\,y2) = 

1=0 

where is the maximum between the numbers of the representative coeffi- 
cients of two arrays yi,y 2 - In practice, in all problems we applied HER, Ni was 
very low. 



0 10 20 30 ij 40 50 60 



Fig. 1. For a given signal, the location of the (absolute) maximum and the relative 
area constitute the first two coefficients for HER. 
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Two immediate properties are: 

— The distance D between a given signal and itself is obviously equal to zero. 

— Any phase change for a signal does not change its HER (invariance to a 
signal phase change). 



Finally, it is to be outlined that HER doesn’t tend to the input signal, in the 
sense that we selected only the maxima of the signal, and then even if ct = 0 we 
should have the input signal sampled by its maxima. On the contrary, selecting 
all the points of the signal and defining the instantaneous energy density as 
follows: 



PE{n) 



E 



n-\-A 

n—n— 



A 



s(n) 



2Z\ 



(15) 



we have that: 



PE(n) s{n). (16) 

Nonetheless, the maxima choice gives a good representation of the energy 
distribution. 

HER has been utilized for many applications where the underlying problems 
required a 1-D representation. Now the problem is how to utilize this represen- 
tation to describe a texture. In other words, how to obtain a 1-D representation 
from an intrinsic 2-D problem like the texture analysis. This will be the topic of 
the next Section. 



3 Our Proposal 

Starting from the foregoing representation, we can see now how to apply it to 
the textures. 




Fig. 2. The spiral covering a squared region. 



Suppose that we have an image 17 with its segmentation obtained at a given 
scale level. Of course, we’ll have a given number of subregions Ri such that: 



U«‘ = 17 and Ri fl Rj = 0. (17) 
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It’s obvious that, starting from this segmentation, it is possible to study the 
micro structure of any Ri looking at its information. So, for a given region (of 
interest) Ri we can always determine a subdomain having any shape. In this 
paper we selected a squared region R\ 

R : l-l = \R\ (18) 

where I is the size of R, such that 

RCRi. (19) 

Starting from R, we now look for a curve 7 representative of it, so that HER 
can be applied. Our proposal consists in a spiral covering the whole region, as 
shown in Fig. 2. In other words we have obtained in this way a 1-D signal too, 
to be utilized using the results of the previous Section. 



4 Experimental Results 



HERTFs performances have been tested implementing it on a PC 233 MHz using 
MATLAB under WINDOWS 98 operating system. In this case, even if the 1-D 
signal obtained using the technique explained in the foregoing Section is quite 
complicate, we have fixed the number of the maxima for HER at 4. As in many 
other applications, a fixed number of maxima allows us to utilize k-d-Tree as 
spatial access structures. In fact, as well-known in literature, it performs better 
than the sequential search leading to a very low computational time 

HUS. 



aiiiDi 



In order to give objective measures of the obtained results, we utilize the 
following valuation criteria usually used for testing a retrieval system: 

The Recall: the system ability in retrieving all relevant time-series; 

The Precision: the system ability in retrieving only relevant time-series. 

In literature we can find also the Normalized Recall (NR), see 0, which can 
be defined as follows. Starting from a set of time-series, where the number of 
the relevant ones is REL. Now, if Ideal Rank and Average Rank are defined as 
follows: 



IR = 



AR = 



E 



E rel 
r—l ^ 

REL ’ 

RANKr 

REL 



( 20 ) 

( 21 ) 



the difference AR-IR, gives a measure of the effectiveness of the system, and it 
can be normalized, in order to range between 0 and 1, in this way: 



NR=l- 



AR-IR 
TOT - REL ■ 



(22) 



The images composing our database are 50 x 50 x 8 bits and relative to 
single cells of liver (sound and cyrrotic) , extracted from images acquired by an 
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Fig. 3. 512 X 512 X 8 bits: An example of normal liver. 




Fig. 4. 512 X 512 X 8 bits: An example of cirrotic liver. 
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Fig. 5. 600 X 600 X 8 bits thyroid image acquired by an electroni microscope at 250x: 
healthy along with three different pathologies. 



Table 1. Normalized Recall relative to HERTI on our liver’s database. 



Measure 


Value 


Size of idb 


1960 


Number of queries 


15 


Normalized Recall 


.987 



Table 2. Normalized Recall relative to the Euclidean Distance on the same database. 



Measure 


Value 


Size of idb 


1960 


Number of queries 


15 


Normalized Recall 


.972 



Table 3. Normalized Recall relative to HERTI on our thyroid’s database. 



Measure 


Value 


Size of idb 


2000 


Number of queries 


20 


Normalized Recall 


.981 
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Table 4. Normalized Recall relative to the Euclidean Distance on the same database. 



Measure 


Value 


Size of idb 


2000 


Number of queries 


20 


Normalized Recall 


.975 



electronic microscope at 40x. Examples of these latter, i.e. respectively normal 
and cirrotic liver, are shown in Figg. 3 and 4. The evaluation of our technique 
performances has been made in this way. After selecting in our database, com- 
posed by 1960 cells images, 15 heterogeneous ones as queries, for each of them, 
we manually selected the 20 most similar ones and then we computed the NR. 

The results showed that HERTI achieves very good results and is very effica- 
cious in retrieving the same clinical case. Moreover, it’s very interesting to outline 
that the results, contained in Table 1, are very promising (very close to 1) and 
look to be better than the ones obtained with databases containing (originally 
without any transform) 1-D signals (see for instance p|). This demonstrates that 
also our solution of seeing a bidimensional signal as mono dimensional one has 
been efficacious. 

Performing for the same queries, the Euclidean Distance (ED), which is a 
useful comparison representing an indicative test for many indexing techniques, 
we can see that, as shown in Table 2, HERTI performs better than ED, also 
considering that the former utilizes only 8 coefficients — the maxima and their 
associated energy. 

Another example is shown in Fig. 5 where HERTI has been performed on a 
database containing images relative to thyroid along with three different patho- 
logies. In this case HERTI performs better than ED too. In Table 3 and 4, there 
are the results in terms of Normalized Recall of, respectively, HERTI and ED. 

5 Conclusions 

In this paper, HERTI, a novel technique for a content based retrieval on images 
databases, has been presented. In particular HERTI is based on HER, utilized yet 
on time-series databases with good results, considering an intrinsic 2-D problem 
like the texture analysis as 1-D problem by means of the concept of characteristic 
curve covering all textel. So, the combination of this concept along the generality 
of HER, allowed us to obtain very promising results, revealing a good ability in 
retrieving images by content. 
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Abstract. We present an algorithm for extracting the surface skeleton 
of a 3D object from its D® distance transform. The skeletal voxels are 
directly detected and marked on the distance transform within a small 
number of inspections, independent of object thickness. This makes the 
algorithm preferable with respect to algorithms based on iterative ap- 
plication of topology preserving removal operations, when working with 
thick objects. The set of skeletal voxels is centred within the object, sym- 
metric, and topologically correct. It is at most 2-voxel wide (except for 
some cases of surface intersections) and includes all centres of maximal 
D® balls, which makes skeletonization reversible. Reduction to a unit 
wide surface skeleton can be obtained by suitable post-processing. 



1 Introduction 

The skeleton of a digital object is a convenient tool for shape analysis. A com- 
monly followed approach to skeletonization is based on the use of topology pre- 
serving removal operations. To have a skeleton centred within the object and 
hence reflecting its geometrical features, removal operations have to be repea- 
tedly applied border after border. This implies generally long computation time, 
proportional to object thickness, which prevents the actual use of the skeleton for 
real time applications. An alternative approach, proposed only for 2D objects, is 
based on the use of the distance transform. In PJ, the city-block distance trans- 
form of a 2D object is considered. On the distance transform, the layers, i.e., the 
sets of pixels having the same distance label, can be interpreted as the successive 
borders that would characterize the object when this undergoes iterated pixel 
removal. On these layers the skeletal pixels can be identified and marked in a 
small and fixed number of inspections. In this paper, we present an algorithm to 
compute the surface skeleton of a 3D object which follows the marking proce- 
dure introduced in and is based on the skeletonization operations introduced 
in m- The obtained surface skeleton is at most 2-voxel wide (except for some 
cases of surface intersections) and includes all the centres of maximal balls. This 
makes our procedure reversible. Reduction to a unit wide surface skeleton can be 
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obtained, e.g., by using the process described in In this paper, however, we 
favour full reversibility and do not perform final thinning. Our surface skeleton 
is centred within the object and is symmetric. It is obtained in a small number of 
inspections, which makes its use convenient for applications when thick objects 
have to be skeletonized. 

2 Definitions and Notions 

The images considered here are volume images consisting of object and back- 
ground, i.e., 3D bi-level images. The central voxel in a 3 x 3 x 3 neighbourhood 
has three types of neighbours: face, edge, and point neighbours. In Fig. [Qthe 
3x3x3 neighbourhood of a voxel (white) is shown. The six face neighbours, 
the twelve edge neighbours, and the eight point neighbours are shown in light 
grey, grey, and dark grey, respectively. 26-connectedness is used for the object 
and 6-connectedness is used for the background. 



z=2 




Fig. 1. Numbering of the voxels in a 3 x 3 x 3 neighbourhood of a voxel v, denoted 
by 13 and shown in white. Face neighbours are shown in light grey, edge neighbours in 
grey and point neighbours in dark grey. 



An object has a tunnel if there exists a closed 26-connected path in the object 
that can not be deformed to a single voxel (for details, see 0), and a cavity if a 
background component is fully enclosed in the object. 

The surface skeleton is a set of voxels with the same number of components, 
cavities and tunnels as the object; it has unit thickness and is centred within 
the object. Skeletonization can be performed by iteratively applying topology 
preserving removal operations to the object, border after border. A number 
of iterations proportional to the maximal object thickness is necessary. Such 
a skeleton is generally not symmetric, because only voxels indispensable for 
object connectedness are kept, and identification of the skeletal voxels depends 
on the order in which border voxels are examined. This is a drawback of iterative 
skeletonization, since symmetry is an important feature for shape analysis. 

A voxel V can safely be removed if its removal does not change the number 
of object components, cavities, or tunnels in its 3 x 3 x 3 neighbourhood, |21 
E]. To verify whether the number of object components changes, the number of 
26-connected components (A^^®) in the 26-neighbourhood of v can be counted. 
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Fig. 2. A D® ball with radius 15. 



Object voxels with ^ 1 should be ascribed to the skeleton. If > 1, 
removal of v would break the connectivity, i.e., a component would break into 
two or more components. If < I, removal of v would cause vanishing of 
an object component consisting of one voxel. To preserve the number of cavi- 
ties and tunnels, the number of 6-connected background components in the 18- 
neighbourhood that are 6-adjacent to v {N j: ) is counted. Voxels with iVj yf 1 

should be ascribed to the skeleton. In fact, if < 1 or > 1 removal of v 
would create a cavity or a tunnel, respectively. 

In the distance transform (DT), each object voxel is labelled by the distance 
to the closest background voxel, 0. In this paper we will use the Z?® distance, i.e., 
the number of steps in a minimal 6-connected path between voxels, to compute 
the distance transform (Z9T®). This is the 3D equivalent of the city-block distance 
transform. The distance value of a voxel of DT^ can be interpreted as the radius 
of a ball centred on the voxel and included in the object. (Note that the shape 
of a D® ball is an octahedron, see Fig. El) An object voxel in DT® is a centre 
of maximal ball (CMB) if the D® ball centred on the voxel is not completely 
covered by any other single ball in the object. On DT®, a voxel is a CMB if none 
of its face neighbours have a higher distance value. From the set of CMBs the 
whole object can be recovered by applying the reverse distance transformation. 
Our skeletonization algorithm does not require the explicit computation of 
since the object connectedness is preserved by other conditions. A voxel sa- 
tisfying any of the following conditions, introduced in 0, is marked as a skeletal 
voxel: 

Condition Al: No pair of opposite face neighbours of a voxel v exists such that 
one is a background voxel and the other is an internal voxel. (See Fig. El left.) 
Condition A2: There exists an edge neighbour, e, of v which is a border vo- 
xel, such that the two voxels that are face neighbours of both e and v are 
background voxels. (See Fig. 0, middle.) 

Condition A3: There exists a point neighbour, p, of v which is a border voxel, 
such that the six voxels that are neighbours of both p and v are background 
voxels. (See Fig. El right.) 

To avoid changing the number of cavities and tunnels, the condition we use 
is based on the computation oi N j , performed by means of the computationally 
convenient method introduced in 0. 
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1 ^ ^ 

Fig. 3. Voxels involved in Condition Al, left, Condition A2, middle, and Condition 
A3, right. White, grey, and black denote background voxels, border voxels, and internal 
voxels, respectively. 



In the following, we use a digital Euclidean ball with radius 28, shown in 
Fig. El left, as a running example. This is a difficult object because the skeleton 
that we compute is based on DT^. The set of CMBs of the object is shown to 
the right of the ball. The shape of the resulting skeleton should be determined 
by the CMBs. 

3 Marking the Skeletal Voxels 

We distinguish three types of skeletal voxels, intrinsic skeletal voxels, induced 
skeletal voxels, for short intrinsic voxels and induced voxels, and tunnel voxels. 
Intrinsic voxels and induced voxels are necessary to guarantee object connec- 
tedness and to prevent changing the number of cavities. Tunnel voxels are ne- 
cessary to prevent changing the number of tunnels. Skeletonization is done in 3 
steps. Intrinsic voxels are found during the first step. Induced voxels are found 
during the second and the third steps. Tunnel voxels are found during the third 
step. 

Once DT^ has been computed, intrinsic voxels can be identified in a parallel 
fashion, while induced voxels can be found only after their neighbourhood has 
been modified by the detection of other skeletal voxels. Different interpretations 
of neighbouring voxels are necessary for intrinsic and induced voxels. For intrinsic 
voxel detection, neighbours of a voxel v with label smaller than v are interpreted 
as belonging to the background, neighbours with the same label as v are seen as 
border voxels, and neighbours with label larger than v are interpreted as internal 
voxels. For induced voxel detection, the presence of marked voxels (i.e., skeletal 
voxels) in the neighbourhood has to be taken into account. To be specific, marked 
neighbours with label smaller than v are interpreted as border voxels. For all 
other neighbours, including the remaining marked voxels, the interpretation used 
for intrinsic voxel detection is followed. 

Intrinsic voxels include all CMBs as well as voxels that are not CMBs, but are 
placed in saddle configurations, saddle voxels. Intrinsic voxels could all be detec- 
ted by using Condition Al, but we specifically detect the CMBs to distinguish 
them from the saddle voxels. We identify and mark parallelwise all intrinsic vo- 
xels during one scan of DT®. (First step of the skeletal voxel detection process.) 

Induced voxels are identified in two steps depending on whether they are 
induced by the CMBs, or are induced by other skeletal voxels (saddle voxels or 
other already induced voxels). The voxels induced by the CMBs have label equal 
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to or larger than the label of the inducing CMBs. They are found and marked 
during one scan of DT^. (Second step of the detection process.) The remaining 
induced voxels have label larger than the label of the inducing voxels. They are 
found and marked during the third step, which consists of eight scans of DT®, 
possibly repeated. In general, one set of eight scans is enough to mark all the 
induced voxels, even for complex and thick objects. Note that no more than two 
sets of eight scans are necessary, due to the constraints posed by the distance 
transformation. During the second and third steps Conditions Al, A2, and AS 
are used. 

Since voxels that would be classified as internal voxels by iterative skeleto- 
nization are directly detected by Condition Al, there is no need of an extra 
condition for preventing creation of cavities. Moreover, since marking a voxel 
by means of Conditions Al, A2, and AS does not change the number of object 
components in the 3x3x3 neighbourhood of that voxel, no merging of cavities 
can occur. Hence, only tunnel creation has to be prevented. Conditions based 
on the computation oi N j are used to find tunnel voxels during the third step. 
To favour maintenance of symmetry, tunnel voxels are detected only starting 
from the fifth scan, since most of the induced voxels have already been found by 
then. In general after one complete set of eight scans, all tunnel voxels are found. 
However, for complex objects, repetitions of the eight scans could be necessary. 

Since detection of the intrinsic voxels can be performed straight forwardly, 
no further details are provided here. Description of induced and tunnel voxel 
detection is given below. 

In the following sections, both the voxel and its associated distance value will 
be denoted by the same letter. 



3.1 Induced Voxel Detection 

During the second step, for each CMB, say s, the 3x3x3 neighbourhood of 
s is inspected. The maximum distance label, max, is computed within each 26- 
connected component of neighbours of s with label equal to or larger than s. For 
each voxel v in the 3x3x3 neighbourhood of s, such that v = max. Conditions 
Al, A2, and AS are checked for v, which is possibly marked. Marking is done in 
a parallel way during this step, which consists of one scan. 

With reference to the following 3x3x3 neighbourhood 





0 0 0 
0 1 1 
1 1 2 




0 1 1 
1 2 2 
2 2 5 




1 1 2 

2 2 5 

3 54 




where the central CMB labelled 2 is 


shown in 


bold 


, the only voxel v for which 



Conditions Al, A2, and AS can be checked is the underlined 4, since 4 is the 
maximum label in the neighbourhood. The 4 is actually marked due to Condition 
AS. Checking only voxels v = max, instead of all voxels u > s, is done to limit 
the number of induced voxels. If all voxels v > s were checked, also the italic 
3’s were marked (in parallel) by Condition A2. These voxels are superfluous for 
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preservation of object connectedness, hence their detection would only produce 
unwanted thickening of the set of the skeletal voxels. 

With reference to the small part of DT^ below 

0 0 0 0 
0 10 1 
0 0 12 
0 12 3 

we note that the maximum distance label in the neighbourhood of the bold 
CMB labelled 1 is 1. If the underlined 1 is not allowed to be checked against 
Conditions Al, A2, and AS, then skeleton connectedness could not be achieved. 
For this reason, when we compute the maximum distance label of the neighbours 
of s, we include also neighbours labelled as s. (Note that in the neighbourhood 
of any other voxel s, that is not CMB, the maximum distance label is always 
larger than the label of s.) 

During the second step some of the saddle voxels can be removed, if super- 
fluous for connectedness preservation. This is the case when a saddle voxel is 
neighbour of both a CMB and the voxel(s) induced by that CMB in step 2. This 
can be seen with reference to the small part of DT® below 

110 0 
12 10 
0 10 0 
0 0 0 0 

showing a CMB (label 2, bold), and two saddle voxels (labelled 1, italic). The 
CMB induces, by Condition AS, marking of a voxel (label 2, underlined), which 
makes the two saddle voxels superfluous. Thus we remove their markers. 

For our running example, the result after the second step is shown in Fig. ^ 
right, and to the left of this, the voxels added by step 2. 



2 100 
110 0 
0 0 0 0 
0 0 0 0 



0 0 0 0 
0 10 0 
0 0 0 0 
0 0 0 0 



0 0 0 0 
0 0 0 0 
0 0 0 1 
0 0 12 



0 0 0 0 
0 0 0 0 
0 0 0 1 
0 0 12 




Fig. 4. From left to right, a digital Euclidean ball with radius 28, the set of CMBs, 
the voxels added by step 2, and the result of steps 1 and 2. 



The remaining induced voxels are detected during the third step, which is 
performed by repeatedly inspecting DT®. Any voxel v, neighbour of an already 
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marked voxel s, such that v > s and such that v is the maximum distance label 
in the 3x3x3 neighbourhood of s is checked against Conditions Al, A2, and 
A3. To keep the symmetry of the skeleton, Conditions Al, A2, and A3 are only 
checked in the direction of the already visited neighbours of v, including the 
inducing voxel s. When a voxel is inspected, one of its point neighbours, three 
of its edge neighbours and three of its face neighbours are already visited. This 
implies that for each inspection. Condition A3 is considered in one direction out 
of eight possible ones. Condition A 2 in three directions out of twelve possible 
ones, and Condition Al in three directions out of six possible ones. We need 
eight inspections, so that Condition A3 is checked in all directions. These eight 
inspections can be arranged in such a way that after four inspections Condi- 
tion A2 and Al are checked in all possible directions. We propose the following 
inspection order, where, for the neighbours, the numbering shown in Fig. Q is 
used, see Tabled The order implies that after four inspections. Condition A 2 
is investigated once in every direction and Condition Al twice. The inspections 
are repeated as far as markers are set. Different ordering of the inspections will 
produce the same result, but the number of inspections could be larger. 



Table 1. Proposed inspection order for the third step. 
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3.2 Tunnel Voxel Detection 

To avoid creation of tunnels, voxels with > 1 are marked. However, this is 
not always enough. With reference to the part of DT^ shown below 



0 


0 


0 


0 


1 


1 


1 


0 


0 


0 


1 


2 


2 


2 


0 


0 


1 


2 


3 


2 


2 


0 


1 


2 


3 


2 


1 


1 


1 


2 


3 


2 


1 


0 


0 


1 


2 


2 


1 


0 


0 


0 


1 


2 


2 


1 


0 


0 


0 



0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


1 


0 


0 


0 


1 


1 


2 


2 


0 


0 


1 


2 


2 


1 


1 


0 


0 


1 


2 


1 


0 


0 


0 


1 


2 


1 


0 


0 


0 


0 


1 


2 


1 


0 


0 


0 



0 


0 


0 


0 


0 


1 


1 


0 


0 


0 


1 


1 


2 


2 


0 


0 


1 


2 


2 


2 


2 


0 


1 


2 


3 


2 


1 


1 


0 


1 






1 


0 


0 


1 


2 


2 


1 


0 


0 


0 


1 


2 


2 


1 


0 


0 


0 



where voxels marked during the first two steps are shown in bold, we see that 
at least one of the 2’s, in italic, and one of the 2’s, underlined, have to be 
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marked to avoid creation of two tunnels. To have a symmetric skeleton, all four 
2’s should indeed be marked. But for all these voxels, N j: =1. Note that if 

an iterative skeletonization algorithm is used, the condition based on is 
enough to prevent tunnel creation, but an asymmetric skeleton is obtained. In 
fact, depending on the order in which border voxels are examined on the border 
“2”, for only one 2 in each of the two pairs (italic and underlined) is =2. 

We need a special condition to fill tunnels consisting of pairs of equilabelled 
voxels. If voxels with the same distance value as the central voxel are interpreted 
as belonging to the background (instead of to the border!) and N j is computed 
in a parallel fashion with this interpretation, then N ^ = 2 for the all four 
2’s. To avoid ambiguities, we will denote by the number of 6-connected 
background components 6-adjacent to the central voxel in the 18-neighbourhood, 
when the neighbours labelled as the central voxel are interpreted as belonging 
to the background. 

Indeed tunnel voxels should be identified only after tunnels are created. Thus, 
tunnels should be filled only after all voxels detected by Conditions Al, A2, 
and AS are marked. To fill the tunnels symmetrically, an iterative procedure 
should be used, active layer after layer in parallel. This requires a number of 
iterations dependent on the size of the tunnels to be filled. Since this can be 
rather large, e.g., when a tunnel is due to two diverging curves originating from 
the same CMB, an iterative method is not convenient. An alternative would be 
that of filling the tunnels, still after all voxels detected by Conditions Al, A2, and 
AS have been marked, by repeated sequential inspections of DT®. This might 
reduce the computation time, but tunnel voxels could be marked asymmetrically 
due to the sequentiality of the process. A compromise is necessary between the 
desires of having a symmetric skeleton and of reducing the computation time. To 
speed up the process, we perform tunnel detection sequentially, and include this 
detection in the third step. (Note that tunnel voxels do not induce any voxel to 
be marked by Conditions Al, A2, and AS.) To limit the asymmetries, we start 
tunnel detection only from the fifth scan, since most of the skeletal voxels have 
already been marked by then. We are aware that the scans needed to detect the 
induced voxels might not be sufficient to fill all tunnels, if these have complex 
structures. Thus the third step, which includes tunnel filling might need to be 
repeated more than twice to fill all tunnels. 

Besides by starting tunnel filling from the fifth scan, we reduce the possibility 
of creating asymmetries also in the following way. During each scan, marking 
by > 1 is done sequentially and marking by > 1 is done parallelwise. 
A^^* is computed by ignoring markers set by Ai|® during the same scan, and vice 
versa. Some voxels, for which NY’ > 1, are redundant if a face neighbour with 
the same distance value is marked by > 1 during the same scan. Therefore, 
if the currently visited voxel v is marked by > 1, any already visited face 
neighbour with the same label and marked by NY > 1> has the marker removed; 
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moreover, any not yet visited face neighbour with the same label is prevented 
from being marked by Nj^ > 1 during the same scan. 

For our running example, the resulting skeleton is shown in Fig. 0 right. The 
result of steps 1 and 2 and the voxels added by the third step are shown to the 
left and in the middle, respectively. Two other examples are shown in Fig. El All 
images used in this paper are 64 x 64 x 64 voxels. The number of scans of DT^ 
necessary to find all skeletal voxels, including tunnel voxels, is 8 (1+1+6) for the 
running example, and 9 and 8 for the two examples in Fig. El Even if it might 
be difficult to see from the Figs., the three surface skeletons are symmetric. Of 
course they are also centred within the objects, includes all CMBs and are at 
most 2-voxels thick. 




Fig. 5. The result of steps 1 and 2, left, the voxels added by step 3, middle, and the 
resulting skeleton, right. 




Fig. 6. Two objects and their surface skeletons. 
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4 Conclusion 

A skeletonization algorithm based on the inspection of the distance transform 
was presented. The resulting surface skeletons are symmetric, fully reversible, 
centred in the object (with respect to D^), topologically correct, and at most 
2-voxel wide. To obtain unit-wide skeletons, post-processing could be done (not 
performed in this paper). Since the skeleton is based on the inspection of the 
distance transform, the number of scans does not depend on the thickness of the 
object (even if tunnel filling depends on object complexity and we can not claim 
that our skeletonization ends in an a priori known number of scans). Thus, the 
computation time is smaller than the time required by iterative skeletonization, 
when thick objects are considered. 

Further work is in progress to improve tunnel filling and to extend the algo- 
rithm to other distance transforms. 
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Abstract. This paper considers how ambiguous graph matching can be 
realised using a hybrid genetic algorithm. The problem we address is how 
to maximise the solution yield of the genetic algorithm when the available 
attributes are ambiguous. We focus on the role of the selection operator. 
A multi-modal evolutionary optimisation framework is proposed, which 
is capable of simultaneously producing several good alternative soluti- 
ons. Unlike other multi-modal genetic algorithms, the one reported here 
requires no extra parameters: solution yields are maximised by removing 
bias in the selection step, while optimisation performance is maintained 
by a local search step. 



1 Introduction 

In realistic settings graph-matching is invariably frustrated by structural error, 
and as a result it is not possible to locate an exact isomorphism. Early attempts 
at inexact matching used heuristics to reconcile dissimilar structures These 
heuristics have been augmented by information theoretic and probabilistic crite- 
ria 0. Over the last ten years, there has been more interest in using statistical 
methods for inexact attributed graph-matching, instead of adopting a purely 
structural approach jS|. However, these methods invariably assume that there 
is a single best match. This approach works best where each of the attributes 
is distinctly located in the feature space, when it is possible to make a single 
minimum risk assignment. However, when the attributes are poor in the sense 
that there is considerable overlap in the feature space, restricting attention to 
the most probable assignments incurs a substantial risk of ignoring alternatives 
which are only slightly less good. For example, in a situation where there are two 
possible assignments with probabilities close to 0.5, it would be unwise to ignore 
the less likely one. This paper will demonstrate how to overcome this difficulty 
for graph matching using a multi-modal hybrid genetic algorithm. 

The idea that genetic algorithms can be used to simultaneously find more 
than one solution to a problem was first mooted by Goldberg and Richardson 
in 0. They attempted to prevent the formation of large clusters of identical 
individuals in the population by de-rating the fitness function. Other techniques 
include crowding jS|, sequential niching and distributed genetic algorithms 
13 . A common feature of these approaches has been the necessity for extra pa- 
rameters. Niching and crowding strategies typically require two or three extra 
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parameters to be controlled. These parameters are needed, for example, to deter- 
mine when to de-rate the fitness of an individual, by how much, and the distance 
scale of the de-rating function. In distributed algorithms, it is necessary to de- 
cide how to arrange the sub-populations, their sizes, and under what conditions 
migration between them may occur. In |^, Smith and co-workers demonstrated 
a situation in which niching could occur in a standard genetic algorithm, without 
the need for any extra parameters. 

The main conclusion that can be drawn from this literature is that the choice 
of selection operator plays a key role in determining the solution yield in multi- 
modal or ambiguous optimisation problems. However, most of the reported work 
is focussed on toy or contrived problems. Our sim here is to undertake a syste- 
matic study of selection mechanisms for the problem of inexact graph matching. 
In particular, we will show how suitable algorithm modifications can improve 
solution yield without introducing any new parameters. We work with a hybrid 
genetic algorithm which incorporates a hill-climbing step, since Cross, Wilson 
and Hancock 0 have found that graph matching was only feasible with such an 
algorithm. 

2 Bayesian Matching Criterion 

The problems considered in this paper involve matching attributed relational 
graphs. An attributed relational graph is a triple, G = (V,E, A), where V is 
the set of vertices or nodes, E C V x V is the set of edges, and A C V x 51?^ is 
the set of measurement fc— vectors relating to the original scene. Graph matching 
is the problem of establishing a correspondence between a data graph. Go = 
( V^i , Ex) , A^) ) , and a model graph, Gm = ( Vm , Em , Am ) • This correspondence, 
f - Vo ^ Vm U {^}, is a labelling of the nodes in with nodes from Vm or a 
special null label, (f>, for unmatchable nodes. 

In pni) Wilson and Hancock described a framework in which both neigh- 
bourhood structure and node attributes were combined in a single measure of 
matching consistency. The goal is to optimise the following a posteriori proba- 
bility criterion 



H(y*|Ax),AM) ^ ) exp[-keD{ru, Sy)]\ 

^{u,v)ef '-I ugVd 

( 1 ) 

where the posterior matching probability, P{u,v\xu,Xy), is the probability of 
node u from the data graph matching node v in the model graph given their 
measurements, Xu and Xy. Structural constraints are captured by a dictionary 
of legal assignments, 0„, over the neighbourhood of each node, v, in the data 
graph. The constant kg = In ^ ^ ^ is defined in terms of the probability of 

matching error Pg. The distance function D(Pu,Sy) measures the similarity of 
the current matching assignment to the data-graph neighbourhood Pu and the 
consistent matching configuration Sy drawn from the dictionary. In CH, Myers, 
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Wilson and Hancock have shown that the Levenshtein distance is a good choice 
for the distance function D. 



2.1 Measurement Ambiguity 

The measurement information contributes to the matching criterion via the po- 
sterior matching probability, P(u^v\XutXv), which has yet to be defined. In [1 Oj . 
Wilson and Hancock defined it in terms of the Euclidean distance between at- 
tribute pairs for non-null mappings: 



P(m, u|x„,x„) = 



(I-P0); 



exp 






2(7^ 









\i V = 4> 
otherwise 



( 2 ) 



where is the prior probability of a null match, f{u) = 4>, which may be 

and 



|Vd|-|Vm| 
|Vd| + |Vm| 



computed using the size difference of the graph to be 2 | 
is the estimated variance of x„. This effectively regards the model graph node 
measurement, x„, as a mean about which the data graph node measurement, 
Xu, varies with estimated variance under the null hypothesis that the two 
measurements are the same (because the nodes match). This approach requires 
the assumption that a data measurement is only likely to be statistically close to 
one of the model measurements. This is ideal when there is little overlap between 
classes, e.g. for possible angles of line-fragments segmented from a radar image. 
However, if there is significant overlap, e.g. in the average intensities of regions, 
such a scheme will not reflect these ambiguities in its classification of features. 

The alternative is to compare the data measurements to the model measu- 
rements using an artificial scale. This can be done by considering the number of 
standard deviations separating the data measurement from its class mean under 
the null hypothesis that the nodes match. Table Ogives an example of such a 
scale for the arbitrary classes “similar”, “comparable”, and “different”. 



Table 1. Example Scale for Measurement Comparisons. 



Class 


Range of standard deviations from Xv 


Similar 


[0,1.0] 


Comparable 


(1.0, 2.0] 


Different 


( 2 . 0 , 00 ] 



Consider the standardised distance, = ||x„ — Xu\\/uu. The probability 
that Auv lies within the interval [a, b] is twice the standard Normal integral from 
a to b: 




dzerf I —— I — erf 



— 1 

,/9 



P{a < Auv <b) = 



( 3 ) 
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Each of the classes in table E corresponds to a separate interval which must 
be considered. Rather than introduce so many extra parameters, it is better to 
simplify the classification to “similar” if Z\„„ G [0,a] and “dissimilar” otherwise. 
Thus, P{u,v\xu,Xy) can be defined as follows 



P{u,v\Xu,Xy) = < 



P^ if V = <p 

— (Xu —^v)^ 




n „ n 


(1 - Pii,)P[Auv < a 


— {Xxi -X-uj)^ 


25-^ 

if ^uv ^ ^ 
< a]) otherwise 



( 4 ) 



For convenience, the original unambiguous definition is used when a = 0. At 
the cost of an extra parameter, a, ambiguous measurements can now be handled. 
The important property of equation 0 is that when a > 0, it assigns the exact 
same probability to sets of mappings, thus enabling different alternatives to be 
considered. 



3 Genetic Algorithms 



Having defined an attribute model which captures the ambiguous nature of the 
raw image attributes, in this section we consider how to use genetic algorithms to 
recover multiple solutions to the graph-matching problem. In a standard genetic 
algorithm, selection is crucial to the algorithm’s search performance. Whereas 
mutation, crossover and local search are all “next-state” operators, selection 
imposes a stochastic acceptance criterion. The standard “roulette” selection al- 
gorithm, described by Goldberg in assigns each individual a probability of 
selection, pi, proportional to its fitness. Pi. The genetic algorithm used here al- 
lows the population, ’®', to grow transiently and then selects the next generation 
from this expanded population. Denoting the expanded population by ’®'ej the 
selection probability of the individual, pi, is given by 



Pi = 



P. 






( 5 ) 



The algorithm then holds selection trials for each “slot” in the new popula- 
tion, for a total of |’®'| trials. Since selection is with replacement, the constitution 
of the new population is governed by the multinomial distribution, and the copy 
number of a particular individual, N(i), is distributed binomially: 



P(lV(*)=r)= (6) 

and so the expectation of N(i), is E[7V(f)l = and its variance is 

Var[fV(z)] = \^\p,{l-p,). 

The search power of the standard genetic algorithm arises from the fact 
that if the individual in question is highly fit, pi will be much larger than the 
average, and hence the expectation will be that the copy number will increase. 
This approach has two disadvantages. The first is that for small populations. 
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sampling errors may lead to copy numbers very much higher or lower than the 
expected values. This can lead to premature convergence of the algorithm to a 
local optimum. In Baker proposed “stochastic remainder sampling” , which 
guarantees that the copy number will not be much different from the expectation 
by stipulating that [E[iV(z)]J < N{i) < |"E[A^(f)]]. However, the larger the 
population, the less need there is for Baker’s algorithm. The second disadvantage 
is that less fit individuals have lower expectations, and that the lower the fitness, 
the lower the variance of the copy number. In other words, less fit individuals 
are increasingly likely to have lower copy numbers. When E[N{i)] falls below 1, 
the individual will probably disappear from the population. In general, the copy 
number variance decreases with decreasing fitness. Only when pi > 0.5 does the 
variance decrease with increasing fitness. This occurs when the fitness of one 
individual accounts for at least half the total fitness of the population, i.e. when 
it is at least |’®'e| — 1 times as fit as any other individual. 

In short, the problem with roulette selection is that it imposes too strict an 
acceptance criterion on individuals with below average fitness. Several alternative 
strategies have been proposed to avoid this problem. “Sigma truncation”, rank 
selection and tournament selection m all seek to maintain constant selection 
pressure by requiring individuals not to compete on the basis of their fitness, but 
on some indirect figure of merit such as the rank of their fitness, or the distance 
between their fitness and the average in standard units. Taking rank selection as 
a typical example of these strategies, the selection probabilities are assigned by 
substituting the rank of the individual for its fitness in equation with the best 
individual having the highest rank. The implication of this is that the expected 
copy numbers of the best and worst individuals are given by: 

E[iV(best)] = 

E[iV(worst)] = 

So, the expected copy number of the fittest individual differs from that of 
the least fit by a factor of Moreover, if If&el is even moderately large, 

E[A^(worst)] will be much less than 1. Indeed, E[7V(i)] will be less than 1 for 
about half the population. Thus, under rank selection, less fit individuals are 
highly likely to disappear, even if they are quite good. 

A second alternative to roulette selection is Boltzmann selection [ 1 411 5] . This 
strategy borrows the idea from simulated annealing, that at thermal equilibrium 
the probability of a system being in a particular state depends on the tempe- 
rature and the system’s energy. The idea is that as the temperature is lowered, 
high energy (low fitness) states are less likely. The difficulty with this analogy 
is that it requires the system to have reached thermal equilibrium. In simulated 
annealing, this is achieved after very many updates at a particular tempera- 
ture. However, in a genetic algorithm this would require many iterations at each 
temperature level to achieve equilibrium, coupled with a slow “cooling” . Within 
the 10 or so iterations allowed for hybrid genetic algorithms, equilibrium cannot 
even be attained, let alone annealing occur. 
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It would appear, then, that there is a tradeoff between premature convergence 
and the strength of the selection operator. The problem arises from the fact that 
expected copy numbers of fit individuals may be greater than one, while those 
of unfit individuals may be less than one. One way of preventing the increase in 
copy number of highly fit individuals is to use “truncation selection” , as used 
in Rechenberg and Schwefel’s evolution strategies ITS1T7I . Truncation selection 
would simply take the best |’®'| individuals from the expanded population, ’®'e, 
to form the new population. The copy number of each individual is simply 1 or 
0, depending on its rank. Although no individual may increase its copy number, 
the selection pressure might still be quite severe, since for the algorithm used in 
this paper, |’®'e| can be as large as 3|’®'|. In other words, less fit individuals still 
disappear at an early stage. The fact that individuals never increase their copy 
number makes this a relatively weak search operator, and probably unsuitable 
for a standard genetic algorithm. However, the gradient ascent step is itself a 
powerful optimise!' cm. and may be mostly responsible for the optimisation per- 
formance of the algorithm. If this is so, selection would be a much less important 
search operator for this hybrid algorithm than it is for standard genetic algo- 
rithms. It may therefore be beneficial to trade search performance for greater 
diversity. 

3.1 Neutral Selection 

The benefits of stochastic selection can be combined with the evenness of trun- 
cation selection by selecting without replacement. This strategy can be called 
“biased selection without replacement” , since it is biased first in favour of fitter 
individuals, although it may also favour less fit ones. 

The alternative is to abandon fitness based selection altogether, and rely on 
the local search step to do all the optimisation. If the genetic algorithm’s role is 
explicitly limited to assembling a good initial guess for the local search operator, 
the selection probabilities can be assigned uniformly, i.e. This 

operator is called “neutral selection” . Neutral selection without replacement can 
be implemented very efficiently by shuffling ’®'e and choosing the “top” |\I/| 
individuals. This strategy shares the advantage with truncation selection, that 
the minimum number of individuals are excluded from the new population, but 
also maintains the global stochastic acceptance properties of standard selection 
operators. 

3.2 Elitism 

Elitist selection guarantees that at least one copy of the best individual so far 
found is selected for the new population. This heuristic is very widely used in 
genetic algorithms. In m, Rudolph showed that the algorithm’s eventual con- 
vergence cannot be guaranteed without it. The elitist heuristic can be modified 
in two ways to help maintain diversity. First, it seems natural that if the goal 
is to simultaneously obtain several solutions to the problem in hand, several of 
the fittest individuals should be guaranteed in this way. This is called “multiple 
elitism” . Second, if one wishes to avoid losing too many unfit individuals, the 
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worst individual can also be granted free passage to the new population. This 
is called “anti-elitism”. These heuristics, together with the selection strategies 
discussed earlier, are evaluated at the end of section 0 

4 Experiments 

This experimental study establishes the suitability of the hybrid genetic algo- 
rithm for ambiguous graph matching, and compares the selection strategies di- 
scussed in the previous section. The algorithm was tested on 30-node synthetic 
graphs Data graphs were generated by randomly perturbing the node attributes, 
and then duplicating 10% of the nodes and perturbing their attributes. The in- 
tention was to simulate segmentation errors expected of region extraction, such 
as the splitting of one region into two similar ones. 

4.1 Comparative Study 

A comparative study was performed to determine the best algorithm for ambi- 
guous matching. The algorithms used were the hybrid genetic algorithm with 
and without mutation, crossover or both (hGA, hGA-m, hGA-x and hGA-xm)^] 
a hybrid version of Eshelman’s GHG algorithm P2| (hGHG), and plain gradient 
ascent (HG). The experimental conditions are summarised in table El 

Table 2. Algorithms for Graph Matching. Each algorithm, apart from HC, made 
approximately 700,000 fitness evaluations. Abbreviations: hGA = hybrid genetic algo- 
rithm, hGA-m = hGA without mutation, hGA-x = hGA without crossover, hGA-xm 
= hGA with neither mutation nor crossover, hCHC = hybrid GHC, and HG = gradient 
ascent (hillclimbing). 





hGA 


hGA-m 


hGA-x 


hGA-xm 


hCHC HC 


Population 


50 


50 


120 


120 


100 


1 


Iterations 


5 


5 


5 


5 


5 


10 


Crossover 


Uniform Uniform Uniform Uniform HUX 


n/a 


Cross rate 


0.9 


0.9 


0.0 


0.0 


1.0 


n/a 


Mutate rate 


0.3 


0.0 


0.3 


0.0 


0.35 


n/a 



Each of the algorithms listed in table 0 except HG, was run 100 times. Since 
HG is deterministic, it was only run once per graph. The results for the different 
graphs were pooled to give 400 observations per algorithm. Algorithm perfor- 
mance was assessed according to two criteria. The first was the average fraction 
of correct mappings in the final population. The second was the proportion of di- 
stinct individuals in the final population with more than 95% correct mappings. 
The results are reported in table 0 

^ These should be regarded as different algorithms, not merely different parameter sets 
for a genetic algorithm, because a genetic algorithm with no crossover or mutation 
is fundamentally different from one which has these operators. For example, the 
hGA-xm algorithm is really just multiple restarts of gradient ascent with a selection 
step. 
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Table 3. Graph Matching Results. Standard errors are given in parentheses. Abbre- 
viations: hGA = hybrid genetic algorithm, hGA-m = hGA without mutation, hGA-x = 
hGA without crossover, hGA-xm = hGA with neither mutation nor crossover, hCHG 
= hybrid CHG, and HC = gradient ascent (hillclimbing). 



Algorithm 


Average Fraction Correct Average Fraction Distinct 


hGA 


0.90 (0.0044) 


0.078 (0.0019) 


hGA-m 


0.88 (0.0051) 


0.040 (0.0012) 


hGA-x 


0.84 (0.0052) 


0.044 (0.00094) 


hGA-xm 


0.76 (0.0068) 


0.013 (0.00036) 


hCHC 


0.92 (0.0042) 


0.012 (0.00033) 


HG 


0.97 (n/a) 


n/a 



At first sight, pure gradient ascent appears to outperform all the other algo- 
rithms. The reason for this is partly that the gradient ascent algorithm starts 
from an initial guess in which about 50% of the mappings are correct, whe- 
reas the other algorithms start with random initial guesses. More importantly, 
the final population of a genetic algorithm typically contains solutions much 
better and worse than the average. Thus, this comparison is not really fair: a 
fairer comparison of optimisation performance comes from considering hGA-xm, 
which is multiple random restarts of gradient ascent. Furthermore, gradient as- 
cent is deterministic, and therefore always gives the same result, but the genetic 
algorithm is stochastic and may do significantly better or worse than gradient 
ascent. Indeed, the genetic algorithm occasionally found matches with 100% cor- 
rect mappings. However, the performance of gradient ascent alone suggests that 
for unambiguous problems, genetic algorithms may not necessarily be the me- 
thod of choice. Apart from pure gradient ascent, the best optimiser was hCHC, 
which is only slightly better than hGA. The results for hGA-m and hGA-x indi- 
cate that crossover and mutation are playing an active part in the optimisation 
process. Turning to the fraction of distinct individuals with over 95% correct 
mappings, it is clear that pure gradient ascent is incapable of finding more than 
one solution. The hGHG algorithm appears to converge to fewer solutions than 
the hGA algorithm. In all, the hybrid genetic algorithm (hGA) combines strong 
optimisation performance with the highest solution yield, and it is this algorithm 
which will be the subject of the remainder of this study. 

4.2 Selection 

Two sets of experiments were conducted to evaluate different selection strategies 
with and without elitism. In each case, a hybrid genetic algorithm was used, with 
a population size of 20, and uniform crossover was used at a rate of 1.0. The 
mutation rate was fixed at 0.4. The first set of experiments used 20, 30, 40 and 50 
node graphs, and for these the population size was set to 10, and the algorithm 
run for 5 iterations. The second set of experiments used four 30 node graphs, 
with a population size of 20 and 10 iterations. Five different selection strategies 
were compared: they were standard roulette, rank, and truncation selection, and 
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neutral and biased selection without replacement. Five combinations of elitist 
heuristics were considered: they were no elitism, single elitism, multiple elitism, 
anti-elitism, and a combination of multiple and anti-elitism. The experimental 
design was therefore a 5x5x4 factorial with 100 cells. The first set of experiments 
had 40 replications for a total of 4000 observations; and the second set had 50 
replications for 5000 observations. Figures^ and summarise the results. 

Both plots show that neutral selection without replacement produced the 
best yields, and that truncation selection produced the worst. Biased and rou- 
lette selection strategies gave similar results, and were both outperformed by 
rank selection. Linear logistic regression analysis of both data sets confirmed 
this ranking of selection strategies. The results for elitism heuristic were not so 
convincing. It is questionable whether elitism has any overall effect: the regres- 
sion analysis of the second data set found no significant effect of varying the 
elitism strategy. The analysis of the first data set did show that either standard 
(single) or multiple elitism gave significantly better yields, but that the effect 
was small. 




Fig. 1. Average Yields versus Selection and Elitism . Data from all four graphs has 
been pooled. 



5 Conclusion 

This paper has presented a method of matching ambiguous feature sets with a 
hybrid genetic algorithm, which does not require any additional parameters to 
achieve multimodal optimisation. The first contribution made was to develop 
an attribute process for ambiguous feature measurements. The second contribu- 
tion has been to explore the hybrid genetic algorithm as a suitable optimisation 
framework for ambiguous graph matching. If most of the optimisation is un- 
dertaken in the gradient ascent step, the tradeoff between effective search and 
maintenance of diversity, which must be made in choosing a selection operator 
for standard genetic algorithms, can be abandoned. Neutral selection without 
replacement maximises the diversity in the next generation with no regard to 
individuals’ fitnesses. 
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Abstract. Our research focuses on Chinese online ink matching that tries to 
match handwritten annotations with handwritten queries without attempting to 
recognize them. Previously, we proposed a semantic matching scheme that uses 
elastic matching with a dynamic programming approach based on the radical 
model of Chinese characters. By means of semantic matching, a handwritten 
annotation may also be retrieved independently of writers via typed text query, 
or stored texts can be retrieved by handwritten queries. This work concerns with 
the behavior of the previously proposed radical model in several aspects 
including character normalization, stroke segmentation, structural information, 
dynamic programming costs and schemes. Based on our study, a new radical 
model is proposed. As a result, the recall of retrieval hy handwritten query 
reaches 90% for the first hit (an improvement of 20% over previous results) and 
the recall hy text query reaches 80% when top 20 matches are returned. 



1 Introduction and Motivation 

In language computing, both on-line and off-line handwritten Chinese character 
recognition (HCCR) have been existing for several decades. Although online 
recognition has the advantage over offline because the temporal order of the input 
points and strokes is provided, it still has proved to be a more difficult problem than 
most people anticipated because of the variations of the way people write and a 
complex training process involved [1]. In addition, a large lexicon is to be 
incorporated due to the large number of characters (3,000 - 5,000) that are daily used. 

Instead of handwriting recognition, some research work has been conducted on 
online ink matching that tries to match a handwritten query against raw ink data 
without attempting to recognize them [4]. This technique can be used in a document 
annotating and browsing system, which enables users to search their personal notes by 
a handwritten query. Similar work and various applications also appear elsewhere 
[6,7]. 

Recently, a semantic matching method was proposed by Ma et al. [5]. By 
extending Wang’s Learning by Knowledge paradigm [8], this method focuses on the 
semantic approach that a human learns and recognizes things and realizes such 



^ Corresponding author. E-mail: mma®research . Panasonic . com 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 407-416, 2000. 
© Springer-Verlag Berlin Fleidelberg 2000 



408 M. Ma, C. Zhang, and P. Wang 



approach in the matching of Chinese handwritten annotations via a radical model. The 
semantic matching has several advantages over previous ink matching methods [4], 
First, it speeds up the existing ink matching by reducing the size of the problem. For 
each query, it returns only top candidates based on the matching of radicals that are 
extracted from handwritten annotations. The traditional raw ink matching is therefore 
applied only to these top candidates instead of the entire database. Secondly, only a 
few radicals are used thus the training process is minimized. Third, it enables the user 
independent retrieval without handwriting recognition. After radicals have been 
obtained from the raw data strings of one user, another user can type in the query by 
keyboard, which can be converted to radical codes immediately. 

As reported in [5], the incorporation of a semantic model speeds up the matching 
process significantly. This is done by returning top 30 (out of 200) candidates in our 
experiments, consequently yielding a reduction of 80% in computation time. The 
drawback of semantic matching, however, is that its recall decreased from that of the 
original raw ink matching due to the low accuracy in radical extraction. The 
performance of radical extraction also affects the overall recall of retrieving 
handwritten annotations by typed text queries. 

This work is to further study the behavior of semantic model and to improve the 
online Chinese ink matching results. The proposed study resulted in a new radical 
model for the matching of Chinese handwritten annotations. The organization of this 
paper is as follows. Section 2 describes several aspects of structural information in 
the radical model and the incorporation of such new model in radical extraction. 
Experiments using our new radical model on the handwritten annotation retrieval are 
described in Section 3. Finally, conclusions are given in Section 4. 



2 Studies on Radical Model 

The radical model for Chinese language is used to identify known radicals from each 
handwritten character and utilize these extracted radicals in the retrieval of 
handwritten annotations. This is called radical extraction. The drawback of the 
previous radical extraction is its performance. Particularly, the traditional dynamic 
programming was used without taking into account characteristics of Chinese 
language. In this section, some new aspects of Chinese radical model will be 
presented in order to improve the radical extraction performance. 



2.1 Character Normalization and Segmentation 

In this work, we employed some normalization and segmentation techniques, and 
experiments show they are adequate. 1) Character size normalization maybe possible 
once characters are successfully segmented. For simplicity, a linear normalization is 
used. 2) The incoming points, which are usually grouped into strokes based on the 
online “pen-down” and “pen-up” information, can further be segmented at local 
minima and maxima of the y values and local minima of the x values. We call these 
breaking points “internal breaking point”. 3) Internal breaking points are further 
determined whether they are “obscure” or “obvious” depending on the degree of 
stroke change near it. If the change of strokes is relatively smooth around the internal 
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breaking point, this breaking point will be considered “obscure” thus eliminated. 4) In 
cursive handwriting, sometimes two separate strokes are connected by an extra stroke, 
i.e. a connection stroke. These extra connections are not random, they are limited only 
to several types. In reality, the connection stroke “IR” may not appear in a 
handwritten character consistently. The extra connection stroke is more likely to be 
affected by the speed and direction of the stylus when the character was formed. 
Therefore, removing this extra connection stroke may reduce the effect on matching 
between two characters, one with connection strokes and the other without [3]. 



2.2 Shape Measurement 

Consider a dynamic programming at stroke level. Let C = and R = be 

stroke sequences for a character and a radical, respectively. The problem of radical 
extraction is to take a series of operations on sequence R, from left to right, and 
transforms it to a subsequence of C. This can be realized by a dynamic programming 
procedure, in which three basic operations on strokes are defined: (a) insert a stroke, 
(b) delete a stroke, and (c) substitute a stroke for another. Each operation is associated 
with a cost. The details of dynamic programming are described elsewhere [4]. 

In previous implementation, stroke insertion cost and stroke deletion cost are 
simply in direct proportion to the length of the strokes. As for stroke substitution cost, 
corresponding points between two strokes are located using a separate dynamic 
programming procedure on point level, and Euclidean distance between each pair of 
two points is measured and summed. This method has two disadvantages. Eirst, the 
dynamic programming on point level is time consuming. Secondly, the Euclidean 
distances between points can be cumulative. 

Ideally, the stroke substitution measures the difference of two strokes, more 
precisely, the difference of their shapes. However, discrepancies exist in the current 



(a) Stroke Si (b) Stroke S 2 




Fig. 1. Discrepancies of substitution cost based on Euclidean distance. 



computation scheme. Eor example, as illustrated in Eig. 1, is the reference stroke, 
while Sj is the stroke to be compared to In the original algorithm, before the 
substitution cost is computed, each stroke is temporarily shifted so that the top-left 
corners of the bounding boxes of the strokes are aligned (Eig. Ic). Although s, and 
are overall similar in shape except the beginning part, they will still yield a large 
Euclidean distance due to the deviation of the beginning part. Therefore, another 
method for measuring the shape similarity of two strokes based on tangent vectors is 
proposed. 
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Tangent vector at a point of a stroke is defined as the vector from the current point 
to its next point along the stroke. Referring to Fig. 2, we define the corresponding 
points of two strokes as follows. Let s, be a stroke with /, points, and be a stroke 
with /j points. P. is the ith point within s,, the corresponding point of P. on stroke 
is P , where j = {i/lj)l 2 - We calculate the substitution cost of two corresponding points 

P and P as follows: 

‘ .1 

point_sub_cost(P. I = 

where v, is the tangent vector at point P. and v- is the tangent vector at P ; 0(v,,v^) is 
the angle between the two vectors, and 0 e[O, Tf]. By summing up the point 
substitution costs for all the points along the stroke s,, we can obtain 

h 

X po'm'L_sub 
i - 1 

stroke_sub_cost{s j,$ 2 ), the substitution cost between stroke s, and as: 
where Z, is the length of stroke s^. By further normalizing, we have 

i i L+i.h ‘i 

max(— )x y point sub cosr( w. 1 .9, ,x.,) 

I I I ■ , “ “ '' I 1 2' 

2 1 ! = 1 

Therefore, we should approximately have 

stroke _sub _coit{s-^,S2) ~ stroke _ sub _coit{s 2, s^) 

As can be seen, the new stroke substitution cost can overcome the two 
disadvantages mentioned earlier. First, by finding the corresponding points, we can 
eliminate the dynamic programming procedure in finding the pairs of corresponding 
points. Secondly, the calculation of substitution cost using tangent vectors does not 
have cumulative effects; therefore, it is a more accurate shape measurement. 
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2.3 Structural Information 

To form a Chinese character, strokes within a character are arranged with some 
structural relationships (i.e. spatial relationship among strokes). Given a stroke 
sequence of a character alone without spatial relationships between the strokes, the 
character can not be determined. In this section, the stroke structural relationships 
embedded in Chinese language will be studied. 

Center Relationships 



fi fm 

\ 




\J 



Cj C„ 




Pn 




Ti and Cj have been matched 
r^ and c„ are under consideration for matching. 

Fig. 3. Illustration of center relationships. 



The weighted center of a stroke can be used to indicate the position of a stroke. The 
structural information can be reflected by the spatial relationship between the two 
stroke centers. Referring to Fig. 3, let the last two matched (substituted) strokes be r 
(the ith stroke of the reference radical) and c- (the jth stroke of the character). The 
strokes currently under consideration for matching are and c^. Let p., p., p^, be the 

weighted centers for r. , c., and c^, respectively. The vector P; reflects the 

spatial relationship between the two strokes r and r^. Similarly, the vector Pj p„ 

reflects the spatial relationship between c and c^. Before two strokes and are 
considered for matching, their spatial relationship with the last two matched strokes 
are examined; 



Rule 1: If 0( pj Pjjj , Pj ) > &r , and c„ will not be considered for matching, 
where 0^ is a threshold, currently set to ti/2. 
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2.3.2 Starting Point and Ending Point 



n 


\ei 


I'm 






Fi and Cj have been matched 
Tm and Cn are under consideration for matching 

Fig. 4. Illustration of starting and ending point relationships. 



Another important feature that reflects the structure of strokes is the relationship 
between ending point of a stroke and starting point of the next stroke. As illustrated in 
Fig. 4, let c- and e be the ending points of the stroke rand c , respectively. Let and 
be the starting points of the stroke and c„, respectively. Our criteria is: 

Rule 2: If 0(e; Sjjj , 6j ) > 6r , then and c„ will not be considered for matching. 

where 0.^ is a threshold, currently set to 7t/2. Sometimes when two consecutive strokes 
are connected, the ending point of the first stroke happens to be the starting point of 
the second stroke, i.e. e- = or e. = s^. In this case, the above criteria will be ignored 
and substitution cost for matching shall be calculated. 



2.4 Categorization of Radicals 

In Chinese language, the arrangement of radicals within a character is not random. 
For example, Cheng et al [1] classified the radical combinations into seven categories 
such as up-down (UD), left-right (LR) etc. According to the analysis of Lin et al. [2], 
over 88% of frequently used Chinese characters belong to the LR and UD types. 
Based on this, we categorize radicals into two main categories. In the first category, 
radicals start the first several strokes of a character, while in the second category, 
radicals end the last several strokes. 

The category that a radical belongs is usually known. This category information 
can be reinforced into our matching process thus a wrong matching will be given a 
higher cost to prevent it from happening. When a reference radical is matched to a 
character, penalty will be added if the matched strokes within the character do not fall 
into the expected category. This is implemented by adjusting the cost in dynamic 
programming procedure. 

If a first category radical is matched to a character, but substitution does not start 
from the first stroke of the character, all operation cost till the first substitution occurs 
will also be added to the total cost. Similarly, if a second category radical is not 
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matched to the last few strokes of a character, all operation cost from the last 
substitution occurs till last stroke of the character will be added as penalty. 



2.5 Location Similarity 

By extending from the concept of radical categorization, Ma et al. defined radical 
profile and location similarity to mathematically represent the location of a radical 
within a character. In radical extraction scheme, location similarity gives a non- 
precise information, or, it can only coarsely confines the location of the radicals [5]. 
The dynamic programming, however, provides more accurate information of how 
well radical strokes are matched. We propose to use coarse information (location 
similarity) to sift out radicals and then use the accurate information (dynamic 
programming cost) to select and extract the radicals. Once radical candidates with 
negative location similarity are removed, the remaining radicals are ranked, according 
to the costs of dynamic programming, and the top two radicals with least costs are 
chosen as the extracted radicals. 

In the previous algorithm, the total dynamic programming cost for matching a 
reference radical to a part of a character is the sum of all operational costs (insertion, 
deletion and substitution). Therefore, for each character, when all reference radicals 
are attempted to match to it, the radicals with fewer strokes tend to yield smaller 
dynamic programming costs. To solve this, we normalize the total dynamic 
programming cost by the length of reference radical. 



2.6 Radical Code Evaluation 



After radicals are extracted for each character, a character can be represented by a 
sequence of radical codes, i.e. radical IDs. When two characters are compared, the 
matching is performed using dynamic programming on a level of radical codes. Three 
basic operations are defined: radical insertion, radical deletion and radical 
substitution, each associated with an operation cost. When evaluating extracted 
radicals, we propose utilizing the radical extraction cost, which reflects the level of 
trust for extracted radicals. In implementation, extracted radicals with higher 
confidence (lower cost) will yield lower radical substitution cost. 



3 Experiments 

Our experimental data consist of three sets: 800 handwritten annotations as reference 
database (Set I) from four subjects, each writing 200 entries, 800 same handwritten 
annotations as query database (Set II) from same four subjects (written the second 
time), and 800 typed text as query database (Set III) corresponding to each of the 800 
handwritten annotations. The experiments are conducted in three areas: the radical 
extraction, which is the core of semantic matching, the overall recall of searching 
handwritten annotations via handwritten queries and the overall recall of searching by 
typed text. Three methods are tested and compared with the traditional raw ink elastic 




414 M. Ma, C. Zhang, and P. Wang 



matching algorithm [4]. A„ represents the original semantic matching algorithm [5]. 

incorporates the new radical model in our study. A/ is almost the same as A,, 
except that the selection of top candidates returns top 15 matches, instead of top 30 
matches. 

Table 1 shows the total number of radicals correctly extracted from the query 
strings and database strings for algorithm and based on the same reference 
radical set. As can be seen, as a result of enhanced algorithm Ay, the number of 
correctly extracted radicals has increased significantly in compare to the original 
algorithm. And the performance gain is approximately 2 to 3 times. 

Table 1. Comparison of radical extraction rate for algorithm A„ and A,. 
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Table 2 lists the recall of the first hits for searching handwriting (Set I) with 
handwritten queries (Set II). As can be seen, the recall of our new algorithm Ay has 
improved by 20% for first three users, and 8% for the 4“' user. Moreover, the 
performance achieved by the new algorithm is very close to that of the traditional 
elastic matching. To compare Ay with traditional elastic matching, because Ay returns 
only top 30 candidates for final matching, it can achieve almost the same performance 
of original elastic matching while saves computation time by 80%. In algorithm Ay’, 
we further reduce the computation time in half by returning only 15 top candidates. 
As a result, the computation time is reduced by more than 90% of the original elastic 
matching while achieving comparable results. In fact, it is very interesting to see that 
for User2, the matching rate of Ay is even higher than that of the traditional elastic 
matching algorithm. The reason is that some interfering candidates for the traditional 
elastic matching algorithm has been removed from the top candidate selection 
process. 



Table 2. Comparison of recall for first hits (searching handwriting with handwritten queries). 
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Userl 
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0.91 


0.88 


User3 
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User4 


0.805 


0.865 


0.855 


0.88 



Figure 5 shows the recall of searching by typed text queries. In this experiment, 
each handwritten annotation (consisting of a sequence of characters) is converted to a 
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sequence of radical codes using radical extraction. When a text query is entered, it is 
immediately converted to a radical code sequence, then compared with the 
handwritten annotation database based on radical codes. In Fig. 5, bottom curves in 
each plot indicate the previous semantic matching results while the top curves stand 
for A, results. As can be seen, the matching rate for the first hit is increased by above 
100%. Overall, the retrieval rate can reach 60% with 10 top matches returned and 
80 % with 20 top matches returned. 



User 1 (Ink vs. Text) 





User2 (Ink vs. Ttped) 




User 4 (Ink vs. typed) 




Fig. 5. The recall of searching by typed text queries. The bottom curves in each plot 
represent the result of A^ and the top curves represent the result of A,. 



4 Conclusions 

Radical extraction plays an important role in semantic matching, in which semantics 
in Chinese language are incorporated early into the segmentation of handwritten 
annotations, and later being used to the matching of handwriting or retrieval of 
handwriting by typed text queries. In this work, we carefully studied and modified 
the radical model, based on which the radical extraction rate has increased by 100% - 
200%. Several other schemes in the semantic matching network are enhanced. As a 
result, the recall of searching handwriting by handwritten queries has increased by 
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20% and reached 90% for first hits, while the computation time can be reduced by 
50%. Moreover, the recall of searching by typed text queries has increased by 100% 
for the first hit and reached about 80% for top 20 matches returned. 

The results of this work have shown great potential of semantics in the matching of 
Chinese handwritten annotations without full bloom handwritten recognition, in 
which large scale training is usually desired. To conclude, Table 3 illustrates the 
comparisons between various methods in the searching of Chinese annotations. It is 
noted that the radical model study in this work may well be extended to other 
languages or symbols and it is our future work. 



Table 3. Comparison of handwriting matching methods. 





Speed 


Performance 


Handwriting 

Searchable 


Text 

Searchable 
(user independent) 


Traditional 

Elastic 

Matching 


Very slow 


Good 


Yes 


No 


Previous 

Semantic 

Matching 


Fast 


Fair 


Yes 


Promising 


New Semantic 
Matching 


Very fast 


Nearly good 


Yes 


Yes 
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Abstract The stochastic extension of formal translations constitutes a 
suitable framework for dealing with many problems in Syntactic Pat- 
tern Recognition. Some estimation criteria have already been proposed 
and developed for the parameter estimation of Regular Syntax-Directed 
Translation Schemata. Here, a new criterium is proposed for dealing 
with situations when training data is sparse. This criterium is based on 
entropy measurements, somehow inspired in the Maximum Mutual Infor- 
mation criterium, and it takes into account the possibility of ambiguity 
in translations (i.e., the translation model may yield different output 
strings for a single input string.) The goal in the stochastic framework 
is to find the most probable translation of a given input string. Experi- 
ments were performed on a translation task which has a high degree of 
ambiguity. 

Keywords Machine translation, stochastic hnite-state transducers, prob- 
abilistic estimation. 



1 Introduction 

A translation is a process that maps strings from a given language (the input 
language) into strings which belong to another language (the output language). 
If both the input and the output languages are formal, then the formal de- 
vices that implement such translations {formal translations) are known as formal 
transducers and have been thoroughly studied in the theory of formal languages 
0. Formal translations of many kinds were initially proposed for compiling 
programming languages [Q and as a framework for a concise presentation of 
error-correction models in syntactic pattern recognition m. 

Regular Translations constitute an important class of formal translations that 
have recently become of great interest as a model in some practical Syntactical 
Pattern Recognition problems in which the classification paradigm is not ade- 
quate since the number of classes could be large or even infinite. In this case, 
the most general paradigm of interpretation seems to be a better framework and 
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can be tackled through formal translations. For example, many tasks in Auto- 
matic Speech Recognition can be viewed as simple translations from acoustic 
sequences to sub-lexical or lexical sequences (Acoustic-Phonetic Decoding), or 
from acoustic or lexical sequences to sequences of commands to a data-base man- 
agement system or to a robot (Semantic Decoding). A more complex application 
in the same line is the translation between natural languages (e.g., English to 
Spanish) | |18I2| . Formal transducers can be learned automatically from exam- 
ples j1 t)|5j . This opens a wide field of applications based on the induction of 
translation models from parallel corpora. 

However, the application of formal transducers to Syntactic Pattern Recog- 
nition needs a stochastic extension due to the noisy and distorted patterns which 
make the process of interpretation ambiguous The statistical parameters of 
the extended models define a probability distribution over the possible trans- 
lations that help decide what is the best translation of a given input sentence. 
A common way of setting these parameters is to learn them from examples of 
translations. 

A Maximum Likelihood algorithm for learning the statistical parameters of 
Stochastic Regular Syntax-Directed Translation Schemata from examples has 
recently been proposed fblHj . This algorithm estimates the parameter set by 
maximizing the likelihood of the training data over the model. The Maximum 
Conditional Entropy estimation criterion which is presented in this paper (see 
Section I2D is based on some ideas from Maximum Mutual Information (MMI) 
PTD| and can be particularly useful when the training data is sparse. 

On the other hand, a learning algorithm based on the MMI criterion had been 
proposed in a previous paper 0 . We have found this criterium to be inadequate 
for translation. A discussion about this matter is given in Section 0 

2 The Translation Schema 

Let S be an input alphabet and A be an output alphabet. A Formal Translation 
T can be defined as a subset of S* x A* . Note that respectively naming E and 
A as the input and output alphabets is an arbitrary decision. We will use the 
terms input and output whenever it helps to make the presentation clearer. 

A Regular Syntax-Directed Translation Schema (RT) is defined as a tuple 
T = {N, E, A, R, S) in which TV is a finite set of non-terminal symbols, E and 
A are finite sets of input and output terminal symbols and S G N is the initial 
symbol of the schema, i? is a set of rules A — >• aB, zB or A ^ a, z, where 
A,B G N, a G E and z € A* . 

A natural extension of the RT is given by the Stochastic Regular Syntax- 
Directed Translation Schema (SRT). An SRT is a pair {BP), where T is an RT 
and P : i? — >-]0, 1] is a function that assigns probability 0 values to the rules of 

^ For the sake of simplicity, in the remainder of the paper we will denote Pr{X = x) 
as Pr{x) and Pr{Y = y\X = x) as Pr{y\x) where X and Y are stochastic variables 
and x and y are two possible values of X and Y, whenever the correct meaning can 
be deduced from context. 
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the schema in such a way that the sum of probabilities of all rules rewriting a 
non-terminal A is equal to 1 {proper SRT M)- Formally, for any non-terminal 
Ai, the set {Ai — >• f3i,Ai — >• (32, ■■ ■ ,^i — t Pn} of all rules that rewrite Ai must 
satisfy the following condition of stochasticity: 

n 

Y^p{a,^P,) = i. ( 1 ) 

i=i 

A finite sequence of rules tf = (ri, r 2 , . . . , rn) such that 



(S', S) ^ {xiAi.yiAi) ^ (a:iX2A2, 2/12/2^2) • • • ^ (a;, 2/) 



is known as a translation form for the translation pair {x, y) G S* x A*. We will 
denote x as input{tf), and y as output{tf). 

Each translation form is given a probability in the model. This probability is 
defined as the product of probabilities of all rules that are used in the translation 
form. I.e., given an RT T and a set of probabilistic parameters <?(T) given by 
some function P, and given a translation form tf = (ri, r 2 , . . . , r„): 

Pr{tf\${T)) = P{ri)P{r 2 ) ■ ■ ■ P(r„). (2) 

Remark that a translation pair {x, y) G S* x A* may be derived in T through 
more than one translation form (there may exist tf, tf in T so that input{tf) = 
input{tf) = X and outpuftf) = outputftf) = y and tf f tf.) Thus, the 
probability of a translation {x, y) must be defined as the sum of probabilities of 
all translation forms that produce {x,y): 

Pr{x,y\<P{T))= Y. Pr{tmT)). (3) 

\/tf/input{tf)—x 
A output{tf)—y 



Given a sentence x in the input language, how can we obtain a translation 
of X in the output language? Note that since the SRT can be ambiguous, a 
single input sentence may be mapped by the model into more than one output 
sentence. This is one of the reason why we need a stochastic extension of the 
models: we will use the statistical information in the model for deciding which 
of the many possible translations of an input sentence is the best one. Following 
this idea, we define the stochastic translation of an input string x G S* in a SRT 
T as the string y* G A* into which x is translated with the highest probability: 

2 /* = argmaxPr( 2 /|a:,<?(T)), (4) 

y&A* 



where 



Pr{y\x,<P{T)) 



Pr{x,y\<P{T)) 
Prix^T)) ■ 
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Given that the probability Pr(a;|^(r)) does not depend upon the maximization 
index y, we can rewrite (0 as: 

y* = argmaxPr(a;, y|^(T)). (5) 

yeA* 

Computing the stochastic translations of the input sentences is the proper 
way to make a translation with an SRT. However, the calculation of has 
been demonstrated to be an NP-hard problem 0. The only possible algorithmic 
solution is the use of some variant of the A* algorithm (e.g., the Stack-Decoding 
ca), which presents exponential computational costs in the worst case and, 
therefore, may not be feasible for some real applications. 

A computationally cheaper approximation to the stochastic translation can 
be defined. Instead of defining the probability of a translation as shown in o , we 
will work with the so-called Viterhi probability of a translation. This is defined 
as the probability of the translation form that most probably yields {x,y): 

Pr{x,y\(l>(T)) = max Pr(t/|^(T)). (6) 

ytf / input{tf)—x 
A output {t f )—y 

The approximate stochastic translation y** of an input sentence x is com- 
puted as an approximation to the stochastic translation defined in (jS|). Here, the 
Viterbi probability is used instead of the standard probability: 

t/** = argmaxPr(a;, y|<?(T)). (7) 

veA* 

There exists a polynomial algorithm for calculating ®. This algorithm sear- 
ches for the maximum probability translation form tf for a given input string x 
so that inputftf) = x. 

2.1 Estimation through Entropy Measurements 

The stochastic translation schema introduced in the previous section is a sta- 
tistical model which can describe probability distributions over the universe of 
all possible pairs of input-output strings, S* x A*. The shape of these distribu- 
tions depends both on the structure of each particular schema and on the set of 
probability parameters associated with the set of rules. Therefore, the process 
of building one of these schemata as a model of a certain probability distribu- 
tion may be performed in two separate phases. First, a non-stochastic schema is 
generated, and second, a set of probability values for the rules in the schema is 
chosen. The generation of the structure can be done either manually or automat- 
ically (there are techniques for inferring RTs from examples; see |l til^j V Once a 
set of rules is given, the set of parameters will often need to be estimated from a 
representative sampl^l of translation pairs. This way we can obtain a stochastic 
schema that approximates the empirical probability distribution. 

^ A sample is any finite collection of translations with repetitions allowed (a multiset 
drawn from E* x A* .) 
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The problem of estimating the parameters of a model from a finite set of 
data has been thoroughly studied in Statistics. A well-known, general-purpose 
method for this is the so-called Expectation- Maximization method El. Here we 
will use the Baum- Welch algorithm, a more specific version of the Expectation- 
Maximization method which is suitable for estimating the probabilities of rules 
in a SRT. Thus, our process of estimation of the parameters of an SRT will be 
as follows. First of all, we define a function that depends both on the statistical 
parameters that we want to estimate and on the pairs of sentences in the training 
data. This function is designed to be sensitive to the relevant information in the 
sample, in such a way that higher values of the function correspond to better 
approximations of the model to reality. We will use the Baum- Welch algorithm 
or some variation of it for finding the optimal value of this function. 

Maximum Likelihood Estimation (MLE) was proposed in |Ej for estimating 
the statistical parameters of an SRT. It is based on the following assumption: it is 
supposed that the sample has been generated by a model that describes perfectly 
the real probability distribution. Under this assumption, the maximization of the 
likelihood of the training sample tends to make the distribution converge to the 
real one for increasingly large samples. The likelihood function to be maximized 
is: 



Rmle{HT))= n Pr{x,y\^{T)), (8) 

(x,y)^TS 



where <P{T) is the set of parameters of SRT T and TS is a training sample. 

In real applications, however, the amount of available data is far from being 
“large” in the theoretical sense, and MLE shows up to be a very poor method for 
estimation. The method that we are presenting in the next section is designed 
to make a better use of sparse data than MLE, and it is based on concepts such 
as conditional entropy and information channels. 

The entropy H{X) is a measure of the number of bits that are needed to 
specify the outcome of a random event X m- Intuitively, entropy can be un- 
derstood as a plausible measure of the level of uncertainty in the event. It is 
defined follows: 



h{x)=-y: Pr{x) log Pr{x) 

X 

Similarly, the conditional entropy of the random event X given the random event 
U is a measure of the uncertainty in X given the outcome of Y : 

H{X\Y) = - ^ Pr{x, y) log Pr{x\y) 
x,y 

A statistical translation system can be interpreted as a bidirectional channel, 
where two sources or random events produce sentences in each of the languages 
involved, respectively, following the real probability distribution of sentences in 
either language. The translation channel performs a probabilistic mapping: it 
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sets the probability Pr{x, y) for each pair (x, y) of sentences where x belongs to 
one of the languages and y to the other. 

If we assume that a real translation between two languages can be properly 
described as one of these statistical translation systems, then our goal is to obtain 
a statistical model m which is as close as possible to the theoretical real system. 
Let Hm{X\Y) stand for the conditional entropy of X given Y with respect to 
the probability distributions in model m. It has been demonstrated in 0] that 
the inequality Hm{X\Y) > H{X\Y) always holds. Furthermore, the smaller the 
value of Hm{X\Y), the more the distribution in model m resembles the real 
distribution. Hm{X\Y) and H{X\Y) are equal when the two distributions are 
the same. Therefore, we want to choose some model m that minimizes Hm{Y\X). 
On the other hand, note that channel models are symmetrical with respect to 
the direction of translation. Hence, we might also want the model to minimize 
Hrn{X\Y). These two simultaneous goals can be achieved by looking for a model 
that minimizes the sum of both values, Hm{X\Y) + Hm{Y\X). 



Maximum Conditional Entropy Estimation. The criterium just mentioned 
can be used to estimate the parameter set of an SRT. First, notice that: 



H{X\Y) + H{Y\X) = - ^ Pr(x, y) log 



Pr'iix^y) 

Prjn{x)Pr 

m (y)' 



For a given RT T together with a parameter set ^(T), Prjn{x,y) is equal to 
Pr{x,y\<P(T)). Since we do not know the real probability distribution, Pr(x,y), 
we must instead assume that the pairs (x, y) in our sample TS are representative 
and choose 4>(T) to minimize 



- log 

{x,v)^TS 



Pr^(x,y|«?(r)) 

Pr(x|^(T))Pr(y|^(T))' 



( 9 ) 



Hence, Maximum Conditional Entropy Estimation (MCEE) is the estimation 
algorithm consisting in maximizing the following function: 



RMCEE{d>{T)) 



n 

(x,y)€TS 



Pr^{x,y\<P{T)) 

Pr(x|<l>(T))Pr( 2 /|<l>(T)) 



( 10 ) 



The reestimation formulae for this function will be obtained through the 
application of an extension of the Baum theorem to rational functions due to 
Gopalakrishnan et al. US]. Let be a transformation from the space <?(T) 

into itself. Then, V(H — >■ aP, zB) € R we have: 



QMCEE^p^A^aB,zB)) 



P{A ^ aB, zB) + C) 

^ a'P', z'B') + c) 



( 11 ) 
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where the numerator can be expanded using 



P{A aB, zB) 

= E 



dlog Pmcee{'1’{T)) 
dP{A^ aB,zB) 

( 2 



{x,y)eTS 



\Pr{x,y\^{T)) 



P{A — >■ aB, zB) 



1 



Pr{x\<P{T)) 



P{A aB, zB) 



dPr{x,y\<P{T)) 
dP{A aB, zB) 

dPr{x\<^{T)) 



1 



Pr(y|<?(T)) 



P{A aB,zB) 



dP{A^ aB,zB) 

aPr(y|<?(T)) 



dP{A^ aB,zB) 



( 12 ) 



and C is an admissible constant [E|. 

The first term in the sum is proportional to the expected number of times 
that the rule is used in the training set in the Maximum Likelihood reestimation 
approach and can be computed as in 0. The second term is proportional to 
the expected number of times that the rule of the input grammar of T is used 
for parsing the set of inputs of the training translations. This input grammar 
is Gi = {N, E, Ri, S, Pi), where, if {A — t> aB,zB) G R, then {A — >• aB) G Rt 
and Pi{A — >■ aB) = P{A — >■ aB,zB). The formulae obtained for a MLE with 
Stochastic Grammars can be used to compute this term | 7 ] . Similarly, the third 
term is proportional to the expected number of times that the rule of the output 
grammar of T is used for parsing the set of outputs of the training translations. 
This grammar is Go = {N, E, Ro, S, Pq), where, if {A — )> aB,zB) G R, then 
{A — >■ zB) G Ro and Po(A — )> aB) = P(A — >■ aB, zB). As for the second term, a 
simple modification of formulae in jZj can be used to compute the third term. 



Discussion about Maximum Likelihood Estimation. A different method 
based on entropy measures was proposed in 0. It was a straight-forward appli- 
cation of the Maximum Mutual Information Estimation (MMIE) by Brown ^ 
to stochastic translation schemata. In Brown’s MMIE it is claimed that mini- 
mizing the conditional entropy H(Y\X) is equivalent to maximizing the mutual 
information, I(X;Y), since 

H{Y\X) = H{X)-I{X-,Y). 

H{X) represents the entropy of the source X and is supposed to be determined 
by some known language model and, therefore, fixed. However, this approxima- 
tion is not adequate if (as it is our case) the language model that is being used 
is not independent from the translation model. When dealing with SRTs the 
probabilities of sources X and Y given by the model, Pr^ix) = Pr{x\I>{T)) and 
P^miy) = Px{y\'P{T)), are a function of the set of parameters of the SRT, ^(T). 
Therefore, they are not fixed during the estimation process and MMIE cannot 
be applied as proposed in |H| . 

MMIE could be used if an independent model for the probabilities of the 
input and the output sentences were given. Such a model could be, for instance. 
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a probabilistic model that represented information about the context of appear- 
ance of sentences within a line of discourse. 

3 Experiments 

Some experiments were carried out to compare MCEE with MLE. The selected 
task was the translation of Spanish sentences into English, as defined in project 
EuTrans-I The semantic domain of the sentences is restricted to tourist 
information, consisting in sentences that a hotel guest would address to a hotel 
receptionist at the information desk. A parallel corpus of paired Spanish-English 
sentences was artificially generated. 

The structure of an SRT was inferred from the corpus by means of a new 
method for building finite-state transducers using regular grammars and mor- 
phisms . The inferred SRT contained 490 non-terminal symbols and 1438 
rules. 

Training was done with 5 different series of training sets. Each series was 
composed of 10 mutually including sets of increasing size, containing 25, 50, 75, 
100, 125, 150, 175, 200, 250 and 300 pairs, respectively. A set containing 500 
different translation pairs was used for testing. The test set is disjoint to all 
training sets. All results were averaged over the 5 series of experiments. 

The test set perplexity for these experiments is shown on the left of figure 
0and word error rate (WER) is shown on the right. Both measures turned up 
to be significantly better for MCEE for the smaller training sets, while ML get 
better results when training sets are greater. 




Figure 1. Test set perplexity and word error rate vs. size of the training set, respec- 
tively. 



4 Conclusions 

A new method for estimating the probabilistic parameters of an SRT with sparse 
training data has been presented in this paper. The method is based on the MMI 
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criterium, although it is different one, since direct application of MMI to SRTs 
is not adequate. Experiments on real data have been reported. MCEE exhibited 
better performance both in perplexity and word error rates for small training 
samples, while ML was better when the available amount of data was greater. 
This seems to point that MCEE is a good estimation criterium for the stochastic 
parameters of SRTs and may be specially useful when training data is scarce. 
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Abstract. In this paper, a procedure which computes error-correcting 
subgraph isomorphisms is proposed in order to be able to take into ac- 
count some external information. When matching a model graph and a 
data graph, if the correspondance between vertices of the model graph 
and some vertices of the data graph are known ’a priori’, the procedure 
is able to integrate this knowledge in an efficient way. 

The efficiency of the method is obtained in the first step of the proce- 
dure, namely, by the recursive decomposition of the model graph into 
subgraphs. During this step, these external information are propagated 
as far as possible thanks to a new procedure which makes the graphs 
able to share them. 

Since the data structure is now able to fully integrate the external infor- 
mation, the matching step itself becomes more efficient. 

The theoretical aspects of this methodology are presented, as well as 
practical experiments on real images. The procedure is tested in the field 
of 3-D building reconstruction for cartographic issues, where it allows to 
match model graphs partially, and then perform full matches. 

Keywords : Graph matching, error-tolerance, external information, 3-D 
building reconstruction, cartography, stereoscopy. 



1 Introduction 

This research worI0 is part of a project aiming at automating the process of 
3-D reconstruction of buildings from high-resolution stereo-pairs. The goal is to 
reconstruct the structure of the roofs. The strategy is model driven. By now, the 
model corresponding to one building is supposed to be known (some models are 
shown in Fig. 0 ). 

The models are attributed relational graphs, each vertex standing for a 3-D 
feature (a 3-D line segment, or a 3-D planar region, or a facade of a building), 

^ This work is supported by the french national cartographic institute, IGN. 
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each edge standing for a geometric property (such as parallelism, orthogona- 
lity, ...). The building reconstruction is based on the computation of a subgraph 
isomorphism between a model G and a graph G/ built on a set of 3 -D fea- 
tures derived from the images. Since under detection and geometrical errors are 
hard to avoid, the subgraph isomorphism computation has to be error-tolerant. 
Among the existing graph matching techniques (laiigini), the error-correcting 
subgraph isomorphism detection (ECD) presented in ^D| is well suited in our 
case. More precisely, the shapes of the buildings have usually common subparts, 
and the ECD is able to take benefit of the common subparts of the model graphs 
in order to reduce the combinatorial complexity of the problem. This is why the 
ECD is used in this work to match the model and the data. The ECD uses the 
concept of edit distance functions, and proposes an efficient method to find the 
subgraph isomorphism which minimizes an edit distance. This work extends the 
ECD, so that some major points of the ECD will be first recalled in Sect. El 
However, for more details about the ECD, and the edit distances, see m, m- 
In this paper, the key point is a modification of the ECD in order to take 
benefit of an external information (e.g. a user input or a pre-computed informa- 
tion) . More precisely, if the correspondance between some vertices of the model 
and some vertices of the data is already known before the matching, the search 
space of the matching problem can be pruned by integrating the external infor- 
mation in the main data structure of the ECD, which is called decomposition. 
This integration has to be done carefully, as shown in Sect .0 

Experimental results about the use of partial matches as external information 
for the matching procedure will be described in Sect .0 



2 Notations 

This section mainly recalls some notations used in the ECD (cni, El)- 

- G = {V, E, p,, 1^) or G = (y, E) denotes a graph, u S E is a vertex, e £ E is 
an edge, p \V ^ Ly is the vertex labelling function, and i/ ■. E Le is the 
edge labelling function. 

- A subgraph of G based upon a subset of vertices Es C E is denoted Gvg = 
(Es, Es, PS, and the difference of two graphs is G — Gvg = Gv\s- 

- The union of two graphs G' = Gi Ue G2 according to a set of edges E 
and an edge labelling function 1/ : E 1-^ Le is defined by : E' = Ei U E2 
] E' — El U E2 D E ; p{v) = pi{v) if u G Ei or p{v) = P2{v) if u G E2 ; 
v{e) = v\{e) if e G Ei, V2{e) if e G E2, v{e) ii e £ E. 

- A subgraph isomorphism from G to G' is a function / : E 1— >■ E' such that / 
is an isomorphism from G to 

- A decomposition D is a set of nodes, each node being a 4 -tuple (G, G', G", E), 
also denoted (G, G', G"), such that : 

— G' C G, G" C G, and G = G' Ue G" : G can be split into its two sons. 

— If (G, G', G", E) £ D, there is no other (iJ, iJ', El” , F) such as G = H : 
G appears only once in the decomposition. 
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— If G' {resp.G”) has more than one vertex, then there is a node {H, H' , H” , 
F) in D such that H = G' (resp. G"). 

— If G' {resp.G”) has only one vertex, then there is no node {H, H', H" , F) 
in D such that F[ = G' (resp. G”). 

For a node (G, G', G"), G will be called the main graph. 

Such a decomposition is a network. Each main graph of a decomposition can 
be recursively split into subgraphs, in a unique way. Figure 1 is an example. 
On this figure, each rounded rectangle represents the main graph G of a 
node (G, G',G") {E is not represented). Each rectangle is pointed by two 
arrows which come from the two sons G' and G" of the node. The figure 
uses 3 labels (a, b, c), and each vertex is identified by a number. The nodes 
themselves are identified by a number next to their bottom-right corner. 
One can remark that the graph 6 is made of two subgraphs (1,2,3) and 
(4,5,6) which are isomorphic, so that the two sons of this node is node 5, 
which is stored only once. In m, Messmer presents a method to insert a 
model graph into the decomposition. This method is recalled Fig. 1. 

Decomposing G into a decomposition D : 



1. In D, compute the largest graph Smax such 
that Smax is a subgraph of G, and such that 
there is a node {Smax, G' , G” , E) in D. 

2. If Smax and G are isomorphic {G has already 
been decomposed), then exit. 

3. If no Smax was found, then choose Smax as a 
random subgraph of G, and decompose Smax- 

4. Decompose G — Smax ■ 

5. Add the node (G, Smax, G — Smax, E) where 
E is such that G — Smax UiJ (G Smax). 




Fig. 1. Left : a decomposition. Right : the standard decomposition algorithm. 

If the correspondance between some vertices and the data is well established 
before the matching step, the method is not able to take this knowledge into 
account. In the next section, a modification of the method in order to use this 
knowledge is proposed. More precisely, we will add some labels to the model 
vertices, and modify some definitions of the graph theory, and then propose a 
new decomposition procedure, which propagates this knowledge throughout the 
decomposition . 



3 Introducing External Information in the Decomposition 

Here, the aim is to integrate some a priori information about the matching. 
Namely, we consider that for some nodes of a model graph G, the matching with 
the nodes of the input graph is already known : for example, the vertex 4 (resp. 
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5) of graph 6 of Fig. 1 is known to correspond to the vertex 17 (resp. 23) of the 
input graph. We will first present a first approach to take this into account in 
the decomposition process. Then, a decomposition method leading to efficient 
match will be proposed. 

3.1 First Approach 

The first approach is to add some labels to the vertices of the model graphs in 
order to use the external information. Formally, the graphs with a priori matches 
(AP-graphs) are defined : 

- An AP-graph is a graph 0°“^ = (P, E, /r, v) where = Ly x ({O, $} U Vi 
) where Vi is the set of vertices of the input graph. 

For a vertex u £ P, we have p,{v) = {I, la) where : 

— la = 'O’ means that there is no a priori match for v. This vertex will be 
called a strict vertex. 

— la = % means that the vertex is not matched with any vertex of the input 
data. This may happen if a partial match of graph G has already shown 
that V has no correspondance in the data G/. 

— la = Vi (with Vi £ P/) means that v is matched with node Vi of G/. 
With this definition, an AP-graph can also be written as G“^ = (P®UP“^, E), 
where C Ly x {O} and P“^ C Ly x ({$} U P/). V°'P (resp. P'*) is the set 
of the vertices of G“^ which are (resp. which are not) a priori matched. 

- The definition of an AP-graph isomorphism between G“^ = (P = P® U 
P“p,F;) and G'“P = (P' = P'® U V'^^.E') is the natural extension of the 
isomorphism of two graphs. However, the equality between two labels of 
has to be defined : two labels {I, la) and {I' , I'a) are equal if and only \il = I' 
and la = I'a- This means that to AP-vertices are isomorphic if their labels 
are the same and if they share the same external information (i. e. if they 
are matched with the same input vertex). 

Thus, when there is a isomorphism / between G“^ and G'“^, we have /(P®) = 
P'® and /(P“P) = P'“P. 

In our example, according to this definition, the problem is to match G“^ 
with the data, where the label of vertex 4 is {B,vn), and the label of vertex 5 
is (G, U23). This graph is shown in the node 6 of Fig. 0 

According to these definitions, the decomposition procedure proposed in PH 
(Fig. 1) still works and takes the a priori matches into account. More precisely, 
when looking for the largest subgraph of the model (step 1 of the algorithm of 
Fig. 1), the graph 5 is found. The difference between graph 6 and graph 5 is 
then graph 7 which has to be split recursively. The largest subgraph of graph 7 
is graph 1, so that we now have to split graph 8 into 2 parts. Since 8 is based on 
2 vertices which are both a priori matched, the method stops just after having 
inserted graph 9 and 10 in the decomposition. 

One can see that the edges between vertices 2 and 4 (and between vertices 3 
and 5) are forgotten very early in the decomposition process. But it is clear that 
when finding matches for graph 5 (which is the subgraph (1,2,3) of graph 6), it 
would be useful to take advantage of this edge which holds much information. 
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namely : the vertex 2 of graph 5 can only be matched with a vertex which is 
a neighbor of the vertex 17 of the input. This is the key point leading to the 
pruning of the search space. However, in order to make the matching process 
of the ECD able to use this information, the information has to be inserted in 
the whole decomposition. The next section shows that it is possible to use these 
informations all through the decomposition process, which leads to an efficient 
matching procedure. 




Fig. 2. Example of a decomposition using the AP-subgraph isomorphism 



3.2 Efficient Knowledge Usage 

The union of two AP-graphs has to be redefined : 

- The union of two AP-graphs Ei) and , E 2 ), 

according to a set of edges E cViX V 2 UV 2 x Vi is the AP-graph G“^ = (U, E) 
such that : V = (Vj® U V 2 U U V 2 ^,Ei U E 2 U E). The only difference 
here is that we do not assume that the intersection between and is 
empty. More precisely, if two graphs share some a priori vertices, then these 
vertices are merged together. 

This definition is important here : since the union of AP-graphs is defined 
for two graphs which share some vertices, the decomposition process will be 
able to propagate some shared information between both sons of a node of 
the decomposition, as shown in the next definition. 

- For a graph G“p = (U* U E) and a subgraph G J = (Us, Eg) of G“^, we 
define the extension of in G“^ as the following : E{Gg^) = 

where E{Vs) = {r" S V°''^\3v' G such that v and v' are neighbors } 

: E{V‘^) is the set of AP- vertices of G that are neighbors of at least one 
strict vertex of the subgraph Gg^. Thus, E{Gg^) is the graph E{Gg^) itself, 
plus the vertices that are on its neighborhood, and that hold some external 
information. As an example, the graph 9 of Fig. Elis the extension of graph 
5 in graph 6, since the two vertices 4 and 5 are neighbors of the nodes 2 and 
3 respectively. One can notice that G’^ is a subgraph of E{G‘^), and that 
E{Gf) is a subgraph of G“^. 
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3.3 The Decomposition Algorithm 

The new decomposition algorithm propagates the a priori knowledge through 
the decomposition. There is only one major change in the graph decomposition. 
Namely, when the largest subgraph of a graph to be decomposed is 

found, we try to see which external information could be added to Smax- This 
is done by extending However, E{Smax) may not appear in the 

decomposition. This is why it has to be added first. Moreover, there is a second 
trick : may be isomorphic to itself : in that case, G“^* is made of 

'^max plu® some a priori vertices. This means that the decomposition of the strict 
part of G“^ has already be done once, but without any ’a priori’ information. In 
that case, we just redo its decomposition, but with external information. This 
means : choose Smax as one of the sons S'max of Smax- 

Once that the first son of G“^ is found, the second one is its complementary 
Gap _ Again, we extend this graph in order to complete it with some 

external information. That way, the a priori knowledge is propagated through 
the decomposition. It leads to the following algorithm : 

1. In Z3, compute the largest graph such that ^ subgraph of G“^, 

and such that there is a node G', G", E) in D. 

2. Let S'max bs fbe extension of S'^^^ in G“^. 

3. If S'^P^ is isomorphic to G“^ then let S'^^^ be E[' where EE , El" , E) 

is in D, otherwise, add S'^p^ into D. 

4. Let S'Zl = E{G-p - S'-P^). 

5. Add S'mSx mfo the decomposition. 

6. Add {G-p,s':Zx,s':^p,). 

In the case of graph 6 of Fig. El the largest subgraph is graph 5. Now, graph 
5 is extend in graph 6. The result of this extension if graph 9, that is put in the 
decomposition. The best subgraph of graph 9 is again graph 5 (because graph 9 
is an extension of graph 5 that we have just computed), so that graph 4 (a son of 
graph 5) becomes the best candidate. Its extension is graph 10 which then has 
to be put into D. The whole run of the method leads to Fig. El which is more 
complex, but which fully integrates the external information. 



3.4 Analysis of the Algorithm Complexity 

This section analyses the coinplexity of the matching process which takes place 
after the decomposition itselfQ. The worst case complexity of the error-correcting 
subgraph isomorphism detection is 0{LmErE) where L is the number of model 
graphs, n is the number of vertices of the model graphs, and m is the number 
of vertices of the input graph (i.e. the number of 3-D features extracted on the 
images). In our case, since we know the match for \V°'P\ vertices, we only have 
to match q = n — \V°'P\ vertices for a graph, so that the worst case complexity 
of the first approach is 0{LmEq^). In the second approach, in the worst case, 

^ The matching process is not described here, because it is a direct extension of the 
standard error-correcting snbgraph isomorphism computation method. 
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we also may have to compute all the matchings from the models to the data. 
In particular the neighborhood between the strict vertices and the AP-vertices 
may not be useful because the edit distance functions consider edge deletion as 
a possible edition : in the case of an edge deletion of a strict- AP edge there is 
no gain because the information held by the AP-vertices is lost. 




Fig. 3. Example of an efficient decomposition 



However, if we do not use the edge deletion as a possible edition (this happens 
in our application) , then let us consider a particular strict vertex Vi . If Vi is not 
neighbor of an AP-vertex, then it could be matched with any input vertex. But if 
it is a neighbor of at set of AP-vertices, then the possible candidates for a match 
with Vi is the set of input vertices which are neighbors of the corresponding input 
vertices. Let’s assume that rrii is the size of this set of vertices, then the worst 
case complexity of the efficient approach is 0{Lmi . . .mqq^). Depending on the 
values of Wi, this makes the combinatorics fall drastically down. 



4 Application 

The results of this study have been applied in the field of building reconstruction, 
which is of great importance (see m, iHi for some particular automatic or 
semi-automatic approaches, and u, m for some collections of articles). 

We are working on high-resolution aerial grey-level stereo-pairs (resolution 
: 8cm / pixel). The building areas are extracted with an automatic procedure 
which focuses on the raised objects (P). These focusing areas are derived from 
a Digital Elevation Model (a regular grid representing the topographic surface), 
see Fig. 0 . 

After this focusing step, 3-D features are extracted from the images and the 
DEM. We work with 3 types of features : 3-D lines segments, 3-D planar regions 
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(see 0), and 3-D facades. The procedures which compute these features will 
not be described here. As an example, Fig. 0 (on the right) shows the 3-D line 
segments. 

After the detection step, the input graph G[ is built : an edge (ui,U 2 ) is 
added if vi and t >2 have a geometric property (such as parallelism, intersection) . 




Fig. 4. Initial data : Left Image, Right Image, DEM, and 3-D line segments 



The models describe the features we are looking for : their edges hold the 
geometric properties typical to 3D objects. The matching process is then initia- 
ted. It identifies the 3-D features which contribute to the final reconstruction of 
the object. This procedure has already been described in 0. 

Here, we will use 3 different models (see Fig. ISJ. On this figure, the numbers 
under each model summarize the size of the model : they correspond to the 
number of vertices for each type of feature : linear, planar, and facade. The 
fourth number is the total number of vertices. 

The first model (A) is a partial gabbled roof (one side is missing) : this model 
has 2 surfaces, 5 line segments (1 ridge, 2 gutters, 2 gabled gutters), 3 facades. 
B is a L-shaped building. C is a L-shaped building with a bevelled corner. It has 
26 vertices, which leads to a very combinatoric problem. 

The number of vertices for the input graph are : 40 linear vertices, 16 facades, 
13 surfaces. 

To avoid combinatoric problems for the complex models such as C, we expe- 
rimented the method and used partial matches. These experiments are summa- 
rized in Tab.n In this table the rows correspond to the model used, the amount 
of external information, the CPU time needed with the initial approach, and the 
CPU needed with the efficient improved approach (the procedure was run on a 
standard 333 MHz PC). 

First, the case 1 shows the CPU times needed for matching model A with the 
data. It is very low. Second, case 2 shows the matching for a L-shaped (model 
B) building. The CPU needed is still reasonable. Third, in case 3, the matching 
for model C leads to a high cost (1 hour), which is not reasonable. 

Thus, the experiment 5 is a match of model C with 2 a priori vertices, na- 
mely the 2 planar regions found during the match of model A : this external 
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information is helpful and makes the CPU times fall down. One should notice 
that the external information in the described case is not a user input but a 
computed data after a previous match. In case 6, we used the four planar faces 
extracted from the case 2, which leads to a very low computation time. 

After the matching step, the shape of th building is fully reconstructed. The 
results of the reconstructions of model B and C are presented in Fig. 0 (right). 
Since model C is a refinement of model B, the use of external information during 
the match seems useful in a coarse-to-fine approach. 



Case 


Model 


External information 


CPU 1 


CPU 2 


1 


partial gabble roof 


none 


0.5 s 




2 


L building 


none 


5 s 




3 


L building with bevelled corner 


none 


Ih 




4 


L building 


2 planar vertices 


5s 


1.7s 


5 


L building with bevelled corner 


2 planar vertices 


>lh 


22 s 


6 


L building with bevelled corner 


4 planar vertices 


Ih 


10 s 



Table 1. CPU Times for the matching 




Fig. 5. Left : The 3 tested models. Right : Reconstructions for model B and C. 



Future Work and Trends 

The objective of this research project is to select automatically the most suitable 
model to describe a given scene. The simplest way would be, to generate a 
database of models through the use of rules, to match the models, and then 
to select the most relevant one. However, once the models are generated, the 
computation times would be too important, especially for the most complex 
ones. For this reason, we start by matching partially the models first, and then 
we use the gathered information to extend the models, as shown in the previous 
section. 

Furthermore, the procedure shown here seems to be suitable for a coarse to 
fine strategy : after having matched a coarse model (such as model B), a better 
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model can be tested on the input data (such as model C). The main interest here 
is to avoid testing model C (which is more complicated) if model B is already 
known to misfit. 

5 Conclusion 

We have shown that the error-correcting subgraph isomomorphism detection 
procedures can be extended in an efficient way to integrate some external infor- 
mation. This extension uses an important change in the data structure involved 
in the method. The method proposed tends to propagate the external infor- 
mation as long as possible during the recursive construction of this structure. 
Due to this propagation, the resulting matching procedure becomes really more 
efficient than a naive one, which is ineffective. 

This framework is applied in the field of 3-D building reconstruction from 
aerial stereopairs, where it shows promising results, because it is possible to 
match partial models and use the results of the matches to match more complex 
models. 
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Abstract. A Pyramid: a hierarchy of Region Adjacency Graphs (RAG) has only 
one type of edge “to be neighbor with”. We want to add some new types of 
edge. The relation “to be inside” is already present as a special case of the 
neighborhood relation. We detail the difference between a neighborhood 
relation and an interior relation. Then we show that some new types of interior 
relations ai'e compatible with the process of building a pyramid. 



1 Introduction 

The Region Adjacency Graph (RAG) has been studied for a long time. Some recent 
works presents different views of this topic, like the pyramid of dual graph of 
Kropatsch [1], or the discrete map of Braquelaire and Brun [2], etc. In GBR’99 [8], 
Bunke and al. in [4] presents some works about type-n graphs, and Deruyver and al. 
in [5] about semantic graphs. We have many discussions about the possible 
extensions of the RAGs. [4] and [5] present different kind of graphs with many 
different types of edges. 

We will present some topics of our discussion in Section 2. Then, in Section 3 we 
will detail the structure of the RAG followed by an answer for a limited case in 
Section 4. And we will end our work by some perspectives. 



2 Discussion 

An image is a grid of pixel that may be seen as graph. In this graph, the relation 
between two nodes is the neighborhood of two pixels. We have only one relation “to 
be neighbor with”. In fact, with the 4-connexity we have four types of relations “to be 
neighbor on the top with”, “...on the bottom...” , “...on the left...”, “...on the 
right...”, but we use them very rarely. The regularity of the grid gives us a kind of 
metric on the image graph, especially when we use the four different relations. During 
the merging process of the graph to extract from the grid graph some semantic 
information, we use only the relation “to be neighbor with”. We have introduced the 
relation “to be included in”, when after a fusion an object become included in an other 
one. In most of the works, the relation “to be included in” is managed more as a trick 
than by it-self. Indeed it is only a special case of “to be neighbor with” and is 
represented by a loop edge in [1]. 
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2.1 The type-n Graph 

Bunke and al. [4] have proposed, in the field of graph matching, to split the 
matching into 3 variants named type-0, type-1, type-2 with a increasing strictness of 
the matching. 

The type-0 use five relations: disjoint, adjacent, intersection, included, including. 
The type-1 imposes that two objects have the same relation along each ax. These 
relations along each ax defined if an object is on the top, on the bottom, on the left, on 
the right of an other object. The type-2 adds 169 new relation: a set of 13 relations by 
axes (before, intersection, included, equal, adjacent, end, start, and the opposite 
relations). We are able to follow this approach to build type-0, type-1 and type-2 
graphs. 

A RAG is type-0, because it uses three of the five defined relations for a type-0 
matching. Disjoint is not represented, adjacent is “to be neighbor with”, including is 
the special case of “to be neighbor with”, included is the opposite of including and the 
relation intersection has no meanings in a RAG (a segmentation is a partition). The 
grid graph is type-1. A type-2 graph may be defined, however it is sometime hard to 
define the relations between two complex shapes. 

It is possible to define a merging process to build a hierarchy of such graphs. We 
just need the merging table. Given two objects A and B merged into C, and an object 
X, the merging table give us the relation CX from the relations AB, AX, BX. 

The relations type-1, type-2 have some metric information. In the grid graph the 
relation “to be neighbor with” is linked to “to be neighbor, by a distance of 1 pixel, 
with”. The RAG has only one type of relation and includes no metric information. 
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Fig. 1. Orientation and type-2. We have on the left the set of possible type-2 relations as in [4]. 
On the right, we have two drawings with their associated graphs. An algorithm is detailed in [4] 
to extract the largest common sub-graph, and so, to track an object in a known environment. 
The main axe must be known. If we turn the drawing by 90°, the relation between the objects 
stay the same, but the graph is very different. 



For the type-2, given two objects there are only nine possible relations (over 13) by 
axe and seven if the two objects have the same size. There are 81, 63 or 49 possible 
relations (over 169). We may see these 81 relations as a 9x9 grid around an object in 
which we locate the other one. We do not have a full metric information, but the 
richness is far better than in the simple case “to be/not to be neighbor with”. The type- 
2 relation allows us to define a graph with a richer information than the RAG and the 
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grid graph. However the choice of the axis is not a simple question, we often need 
some outside information. See Fig. 1. 



2.2 The Semantic Graph 

An other way to analyze an image is the partial matching between the image and 
some regions already segmented of an a priori model imbedded in a semantic graph. 
The semantic graphs embed two important information: the structural information and 
the contextual information. The structural information is defined by the unary 
constraints of each node. These constraints define, for example, the topology of the 
region. The contextual information is defined by the binary relations of the edges. 
They define, for example, the spatial relation of two regions. These graphs are very 
smart to encode the complex constraint and they allow us to compute the influence of 
one node overall graph. See Fig. 2. 




right-left 
behind-iQ &oac 
under-behind 



' d) 



10; right subsumia mgra 
1 1 : cerebral trunk 
12,13: undifTereoluited tissue 
14: septum lucidum 



Fig. 2. Semantic Graph. Example of a semantic graph build from a human brain description. 
We have four types of relations. This graph is reproduced from [9]. 



2.3 RAG Weakness 

After this short presentation of two types of graph that has a richer set of edge’s type 
than the RAG, we will present some weakness of the RAG. See Fig. 3. The RAG 
encodes o relation about the included regions, but from the human perception point of 
view there is a big difference. 

In [3], we have shown that the theories [1,2,6] are equivalent. We will not detail 
the addition of new edge for each of these theories. In addition, it has been shown that 
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we can represent the pyramid, as a labeled graph [7]. Therefore, we have a smaller 
structure to encode the multi-levels segmentation, and an easy merging process. For 
all these reasons, we want to enhance the RAG by the introduction of new edge type. 
However, is it possible to extend the types of edge in a RAG-like graph, while 
preserving the easiness of the building of the hierarchy? 




Fig. 3. At left, the 5 objects included in the central region are the same, only their relative 
positions change. No RAG-like structure can represent this kind of differences. The two other 
cases are other examples where from a human point of view there is a big difference between 
the two drawing, but the RAG can not catch it. 




Fig. 4. (a) A partition, the face graph, i.e., the graph of the element’s border, (b) A 
simple region adjacency graph, (c) A RAG with double edges (the thick edge). We 
must add multiple edges to be able to recover the face graph by duality when the 
common border of two elements is not connected, (d) A RAG with multiple edges and 
a loop (the thick edge). We must add a loop to be able to recover the face graph by 
duality when an element of the partition is included in an other. 



3 Region Adjacency Graph 



A Region Adjacency Graph is a graph build on a partition of the 2D space where each 
node is an element of the partition, and each edge is an adjacency relation between 
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two elements. The Fig. 4b shows us a simple Region Adjacency Graph (RAG) 
obtained from the partition in Fig. 4a. 



3.1 Two Types of Edges in RAG 

When we merge some nodes, two new situations appear. Two extensions have been 
developed when two elements have a non-connected common border (Fig. 4b), and 
when a region is surrounded by an other (Fig. 4c). 

In a RAG or in the Frontier-Region Graph [6], there is only one type of edge “to be 
neighbor with”. Flowever, when a region is surrounded by another, the relation 
between the inside and the outside region is more than “to be neighbor with”: it 
becomes “to be included in”. In the Discrete Map, the relation “to be include in” is 
encoded in a separate structure, an inclusion tree defined by a function that return for 
each element the father, i.e. the surrounding one. 

We name the edges “to be neighbor with”: neighborhood edges and the edges “to 
be inside of’: interior edges. 



3.2 Transition of Types 



By merging the nodes, we build a pyramid: a hierarchy of graphs. During a merge of 
two nodes, the only one transition may appear is a couple of neighborhood edges 
(Fig. 5a) that becomes an interior edge (Fig 5b). 




Fig. 5. Merge of a graph that produces an interior edge (the thick one). 



We will now detail with few math. The support S of an image is, in a generic way, 
a finite partition of the 2D Euclidean space with only one infinite element the 
background. 

Let A, B and C be three elements of the partition. We note: 

A<r^B when A is a neighbor of B (but not included in B). (1) 



A— >5 when A is included in B. 



( 2 ) 



B®C the merge of B and C, i.e., B'uC when B and C are adjacent. (3) 

The Fig. 6 presents 4 drawings of the 4 cases that appear during a merge and are 
example of the equations (4), (5) and (6). 
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Fig. 6. (a) and (b) shows an example for (4), (c) for (5) and (d) for (6). For each drawing, the 
top shows a partition example and the bottom the corresponding graph. 

We have the following relations (Fig. 6): 

and B<->C => A<->5©C or A->5©C. (4) 

A— >5 and 5<->C => A— >5©C. (5) 

A— >5 and A<->C => A©C— >5 (and C— >5 before the merge) (6) 

In (6), if A is included in B, and we merge A and C, C must be included in B. For 
the topological interior, this is obvious, but we will see some other possibilities. 



4 New Interior Edge 



In the common way, when we say something is inside another, we mean more things 
that the interior edge we discuss before. We use only a topological meaning of 
interior. In Fig. 7, we present three different cases where A is in B. 



o o ^ 



Fig. 7. A is the dot, B is the ellipse, (a) A is inside B hy the topological way, (b) by a 
“mechanical” way and in (c) hy an “optical” way. 

The topological interior relation has already been studied for years. We will study 
the third one, the optical interior relation. A is “optically” inside B, if A is 
topologically inside the smallest convex shape surrounding B. A is “mechanically” 
inside B, if A can not be moved outside B. We will further extend the result to the 
“mechanical” relation. 
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Let A, B be two elements of the partition. We note: 

A\/B if A is “optically” in B, and A is not included in B. (7) 

Let Cvx(A) be the smallest convex surrounding A. ( 8 ) 

We have: Av5 A-^Cvx{B) (9) 

We want only non-ambiguous relation. We will say that A is “optically” in B only 
if A is not already included topologically in B. 

4.1 Merging Process 

First, suppose we have a RAG with three types of edges: the interior one and the 
neighborhood one as define in (1), and a third one the “optical” interior one. We will 
study if this kind of edge is compatible with (1). We will then study the creation of 
such edges. 

The Fig. 8 shows the new cases that appear during a merge. We have the same 
relations (4) and we have the new following relation: 

Avfi and ^ Av5©C (Fig. 8a) (10) 

Proof. AvB ^ A— >Cvx(5), CcCvx(C), A— >Cvx(5)©Cvx(C)cCvx(B©C). ♦ 




But can not say directly something about A©C and B. See Fig. 8b. In this case, we 
do not have obviously CvB before the merge. However, we have:. 

AvB and CvB and A<->C => A©CvB (11) 

AvB and B^C => AvB®C (12) 

Proof. AvZ? => A— >Cvx(B)cCvx(C), AvC. ♦ 



AwB and C— => AwB®C 



(13) 



AwB and C— >A => A©CvB 



(14) 
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We have the same problem for Av5 and A— >C: we do not have CvB as for Fig. 8b. 
See Fig. 8c for an example. However, we have:. 

AvB and CvB and A— >C => A® CvB (15) 



In an algorithmic point of view, we need to check two new conditions before doing 
a merge to verify that both merged regions are included in the same one when AvB 
and A— >C or A<->C. If the condition is not verify, we must suppress the edge AvB 
during the merge. 

We have now a graph, with three types of edges: neighborhood, topological 
interior and optical interior. And with a small addition, the merging process stays the 
same. An important change is that during a merge an edge not member of the merged 
set may be suppressed. A new side-effect appears. Before, the only edges remove 
from the pyramid are the edge in the merged set, the set of edges that connected two 
vertices that would be merged during the merging process. 

Even if this change is important in the concept of the pyramid, it has little impact 
overall structure, because there is no transition between topological interior edge and 
optical interior one. The merge may appear only between two neighbor vertices, we 
can not merge an optical interior edge. 

The optical interior edge are on an other level, they are create by user interaction, 
and stay in the structure without truly modify it. Sometime they are suppressed to 
keep the structure coherent. However, an other way exists. 



4.2 New Region 

We will now relax the constraint on the type of edge used in a merging process. We 
want to allow the merge to occur between any two vertices linked by an edge, even if 
this edge is an optical interior one. This will break some topology stuff. The 
segmentation in this way will not stay a partition, because some elements may be not 
connex. 

For most of the new case, the merge is possible without any trouble. 



AUB and BvC ^ AUB®C (16) 

AvB and BvC => AvB®C or BvA®C or CvA®B (17) 

AvB and CvB ^ AvB®C or BvA®C or CvA®B (18) 

AvB and AvC ^ AvB®C (19) 



We got a problem only when AvB and AvC and we want to merge A with B or C. 
The Fig. 9 shows an example of this new situation. We must have BvC or CvB to 
merge A without loosing an edge. This case has been already studied in (17). The new 
situation of Fig. 9 is similar to the Fig. 8c one. The merging process stays simple but 
in two situations (Fig. 9 and Fig. 8c) one edge must disappear. 
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Fig. 9. When AvB and AvC, we can not merge A with B or C, without breaking one edge. 

For the mechanical interior edge, the merging process is pretty the same than for 
the optical interior edge. We do not detail it. 



5 Conclusion 

By the relaxation of the connexity constraint, we have a new definition of 
segmentation. The constraint is not fully removed, it is moved from the structure to 
the merging process. We have shown that some extensions of the edge’s types 
changes a little the merging process. So, during the merging process most of the 
merge will occur when A— or A<->B, but when an optical interior edge become 
meaningful, we add it to the structure. The merging process may then follow to 
different may: the classic one (Sec. 4.1) where only adjacent elements may be 
merged, or the new one (Sec. 4.2) where any edges may be merged, and some non- 
connected region may appear. The representation of non-connected region may a 
point of interest. For example, in Fig. 3 left, the ellipse is optically in the set of circle, 
rectangles and triangles. We have a classic subjective region in Fig. 10. 







Fig. 10. Subjective region as an example of possible use of optical interior edge. 

After a discussion about the interest of adding some new type of edges to a RAG- 
like structure, we have presented an extension of interior relation. This extension was 
made with a minimum of changes of the merging process. This point of view has 
allowed us to consider some new pyramid, i.e., without the connexity constraint, but 
with the same merging process. The two next steps will be to study the extraction of 
these new edge types, and in a more theoretical side to study the properties that must 
respect the edges, in general, to be compatible with this merging process. 
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Abstract. An algorithm for learning structural patterns given in terms of 
Attributed Relational Graphs (ARG’s) is presented. The algorithm, based on 
inductive learning methodologies, produces general and coherent prototypes in 
terms of Generalized Attributed Relational Graphs (GARG’s), which can be 
easily interpreted and manipulated. The learning process is defined in terms of 
inference operations especially devised for ARG’s, as graph generalization and 
graph specialization, making so possible the reduction of both the 
computational cost and the memory requirement of the learning process. 
Experimental results are presented and discussed with reference to a structural 
method for recognizing characters extracted from ETL database. 



1 Introduction 

Structured patterns are patterns represented in terms of simple parts, often called 
primitives, and relations among them |Q. They are generally represented by means of 
Attributed Relational Graphs (ARG’s), e.g. associating the nodes and the edges 
respectively to the primitives and to the relations among them. If necessary, their 
properties are represented by attributes both of the nodes and of the edges. 

Despite their attractiveness in terms of representational power, structural methods 
(i.e. methods dealing with structured information) imply complex procedures both in 
the recognition and in the learning process. In fact, in real applications the 
information is affected by distortions, and consequently the corresponding graphs 
result to be very different from the ideal ones. So, in the recognition stage the 
comparison among the input sample and a set of prototype graphs cannot be 
performed by exact graph matching procedures [^. Moreover the learning problem, 
i.e. the task of building a set of prototypes adequately describing the objects of each 
class, is complicated by the fact that the prototypes, implicitly or explicitly, should 
include a model of the possible distortions. For these reasons nowadays the problem 
is still under investigation: many of the approaches proposed during the last years 
consider this task as a symbolic machine learning pro blem, introduc ing description 
languages often more general and complex than needed 

The advantage making this approach really effective relies in the obtained 
descriptions: since the learning method is oriented toward the construction of 
maximally general prototypes, they are expressed as compact predicates, easily 
understandable by human beings. The user can acquire knowledge about the domain 
by looking at them and consequently he can validate or even improve the prototypes 
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or at least understand what has gone wrong in case of classification errors. On the 
other hand these methodologies are so computationally heavy, both in terms of time 
and memory requirements, that only simple applications can be actually dealt with. 

Our approach is similar to these methods, but has the peculiarity that descriptions 
are given in terms of Attributed Relational Graphs. The use of a somewhat less 
powerful description language is compensated by the ability to express the operations 
of our learning method directly in terms of graph operations, with a significant 
improvement in the computational requirements of the system. To this aim, we will 
introduce a new kind of ARG, called Generalized Attributed Relational Graph, 
devoted to represent in a compact way the features common to a set of ARG’s. Then, 
in section 3 we will formulate a learning algorithm operating directly in the graphs’ 
space: it finds out prototypes that are both general and consistent, like classical 
machine learning ones. Section 4 reports an experimental analysis of the method with 
reference to a problem of character recognition, using a standard character database. 



2 Preliminary Definitions 

An ARG can be defined as a 6-tuple (At, E, , Ag , ) where N and EcNxN 

are respectively the sets of the nodes and of the edges of the ARG, A„ and the sets 
of node and edge attributes and finally and the functions which associate to each 
node or edge of the graph the corresponding attribute. 

We will assume that the attributes of a node or an edge are expressed in the form 
t(Pi,...,Pt ), where r is a type chosen over a finite alphabet T of possible types and 

(Pj , . . . , Pj ) are a tuple of parameters, also from finite sets P/ , . . . , P/ . Both the 

number of parameters (A:,, the arity associated to type t) and the sets they belong to 
depend on the type of the attribute, so that we are able to differentiate the descriptions 
of different kinds of nodes (or edges), as explained in fig. 1 . 

Let us introduce the concept of Generalized Attributed Relational Graph (from 
now on GARG). Basically a GARG is an ARG with an extended attribute definition: 
the set of types of node and edge attribute is extended with the special type cj), 
carrying no parameter and matching any attribute type, with no regard to the attribute 
parameters. For the other attribute types, if the sample has a parameter whose value is 
within the set P.', the corresponding parameter of the prototype belongs to the set 
p*‘ = p(p ‘ ) , where p(X) is the power set of X, i.e. the set of all the subsets of X. 

We say that a GARG G* = {N* ,e\ ,u*e) covers a sample G (G*|=G, 

where the symbol |= denotes the relation from now on called covering) iff there is a 
mapping ju:N*^N such that: 

1. p is a monomorphism', that is: 

^ III ^ p(nl) it p(nl) \/{nl ,nl) G E% (ju(nl), ju(nl)) G E (1) 

2. the attributes of the nodes and of the edges of G* are compatible with the 
corresponding ones of G; that is: 
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\/n* G N* , ah(n*) > a^(p(n*)) ; \/{nl ,nl) g E* , ,nl) > a^{p(nl),p(nl)) (2) 

where the symbol ^ denotes a compatibility relation, defined as follows: 

Vf, ) ; \/t,t(p‘,...,pl)yt(p^,...,p^)<^p^ gpI a...apj_ gpI (3) 

NODE TYPE ALPHABET 
T = { rectangle , circle } 

^rectangle 2 
ICcitrlp 1 

^^rectangle ^ ^ 

angle ^ ^ 

^arcle _ _ [s,m,l] 

EDGE TYPE ALPHABET 
T = { on_top } 
k - 0 

’^on_top ^ 



circle(m) 




b) c) 

Fig. 1. a) An object made of two different kinds of primitives (circles and rectangles) and b) 
the corresponding graph, c) The type alphabets. The description scheme defines two types of 
nodes, each associated to a different primitive. Each type contains a set of parameters to 
suitably describe a component of that type (s, m, I stand for small, medium, large, respectively). 
Similarly edges of the graph describe topological relations among the primitives. 



Condition (1) requires that each primitive and each relation in the prototype must 
be present also in the sample, while the converse condition does not hold; this allows 
the prototype to specify only the features which are strictly required for 
discriminating among the various classes, neglecting the irrelevant ones. Condition 
(2) constrains the monomorphism required by condition (1) to be consistent with the 
attributes of the prototype and of the sample: the compatibility relation defined in (3) 
simply states that the type of the attribute of the prototype must be either equal to (f) or 
to the type of the corresponding attribute of the sample, in which case all the 
parameters of the attribute (that are actually sets of values) must contain the value of 
the corresponding parameter of the sample. 

Another important relation that will be introduced is specialization (denoted by the 
symbol <): a prototype Gj* is said to be a specialization of Gj iff: 

VG, G’hG^GjhG 

2 I (4) 

In other words, a prototype Gj* is a specialization of gJ if every sample covered 
by Gj* is necessarily covered by Gj too- Hence, a more specialized prototype imposes 
stricter requirements on the samples to be covered. 
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Notice that the specialization relation introduces a non total ordering in the proto- 
type space, whose minimum (if we consider only non-empty graphs) is the GARG 
having only one node with attribute (j): any non-empty GARG is a specialization of it. 



3 The Learning Algorithm 

The goal of the learning algorithm can be stated as follows: the algorithm is given a 
training set S of labeled patterns, partitioned into C different classes ( 5 = 5j n.-.fl 5c 
with Si n Sj = 0 for i j), from which it tries to find a sequence of prototype 
graphs G*,G 2 ,...,G* , each labeled with a class identifier, such that: 

1- \/GgS 3i : G* \= G (completeness of prototype set) (5) 

2. \/GgS G* 1= G ^ class(G) = class(G* ) of the prototype set) (6) 

where class(G) and class(G*) refer to the class associated with samples G and G* 
respectively. 

Equations (5) and (6) would be simply satisfied by defining a prototype for each 
sample in S. However, such a trivial solution requires a number of prototypes which 
could be too large for many applications; besides in a complex domain it is difficult to 
obtain a training set which covers exhaustively all the possible instances of a class. 
Hence, for eq. (5) the prototypes generated should be able to model also samples not 
found in S, that is they must be more general than the enumeration of the samples in 
the training set. However, they should not be too general otherwise eq. (6) will not be 
satisfied. The achievement of the optimal trade-off between completeness and 
consistency makes the prototypation a really hard problem. 

To this concern, our definition of the covering relation, which allows the sample to 
have nodes and edges not present in the prototypes, is aimed at increasing the 
generality of the prototypes; in fact, each prototype must specify only the distinctive 
features of a class, i.e. the ones which allow the class’ samples to be distinguished 
from those of other classes; optional features are left out from the prototype, and their 
presence or absence has no effect on the classification. 

It’s worth pointing out that in our definition of GARG’s there is no possibility of 
expressing negation ', this fact allows our method to employ a fast graph matching 
algorithm suitable to verify whether a prototype covers a sample, instead of 

the usual unification algorithm. On the other hand, such a lack limits the 
expressiveness of the prototypes. In order to deal with situations in which the patterns 
of some class can be viewed as subpatterns of another class, a sample is compared 
sequentially against the prototypes in the same order in which they have been 
generated, and it is attributed to the class of the first prototype that covers it. One of 
the strength points of the proposed learning method is the automatic handling of these 
situations, by adopting a learning strategy which considers simultaneously all the 
classes that must be learned in order to determine (without hints from the user) the 
proper ordering for the prototypes. 
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A sketch of the algorithm is shown by the following code, where S(G*) denotes 
the sets of all the samples of the training set covered by a prototype G* , and S {G*) 
the samples of the class i covered by G* . 



A sketch of the learning procedure 

FUNCTION Learn (S) // Returns ordered list of prototypes 
L := [ ] // L is list of prototypes, initially empty 
WHILE S^0 

G* := FindPrototype (S) 

IF NOT Consistent (G*) THEN 

FAIL // Algorithm terminates unsuccessfully 
END IF 

/ / Assign prototype to the class most represented 
class (G*) := argmaxi |Si(G*) | 

L := Append (L, G*) // Add G* to the end of L 
S := S-S(G*) // Remove the covered samples from S 
END WHILE 
RETURN L 
END FUNCTION 



It is worth pointing out that the test of consistency in the algorithm actually checks 
whether the prototype is almost consistent: 



ConsistentiG*) <» rnax 



5(G*) 



>d 



(7) 



In eq. (7) ^ is a threshold close to 1 , used to adapt the tolerance of the algorithm to 
slight inconsistencies in order to have a reasonable behavior also on noisy training 
data. For example, with 0 = 0.95 the algorithm would consider consistent a prototype 
if at least the 95% of the covered training samples belong to a same class, avoiding a 
further specialization of this prototype that could be detrimental for its generality. 

Note that the assignment of a prototype to a class is done after the prototype has 
been found, meaning that the prototype is not constructed in relation to an a priori 
determined class: the algorithm finds at each step the class which can be better 
covered by a prototype and generates a prototype for it. In this way, if the patterns of 
a class i can be viewed as subpattern of samples of another class j (e.g. the graphs 
describing the character ‘F’ are often subgraphs of those representing character ‘E’), 
the algorithm will cover first the class i and then the class y; in this case, we say that 
the prototypes of the class i have precedence over those of class j. 

The most important part of the algorithm is the FindPrototype procedure, that 
performs the construction of a prototype, starting from the trivial GARG (which 
covers any non-empty graph) and refining it by successive specializations until either 
it becomes consistent or it covers no samples at all. The FindPrototype algorithm is 
greedy, in the sense that at each step it chooses the specialization that seems to be the 
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best one, looking only at the current state without any form of look-ahead. This search 
is guided by the heuristic function H, which will be examined later. 



The function FindPrototype 

FUNCTION FindPrototype (S) // Finds the best prototype 

// covering one class in S 

G* := TrivialPrototype // Only one node with attr. tj) 
WHILE |S(G*) I > 0 AND NOT Consistent (G*) 

Q := Specialize (G*) 

G* := argmaxx^Q H (S , X) // H is heuristic function 
END WHILE 
RETURN G* 

END FUNCTION 



3.1. The Heuristic Function 



The heuristic function FI is introduced for evaluating how promising a provisional 
prototype is. It is based on the estimation of the consistency and completeness of the 
prototype (see eq. 5 and 6): 



H{S,G') = (5, G* (5, G*) = |5(G*)| • (7(5) - 7(5(G*))) (8) 

where 






\Si\ 

| 5 | 



7(5(G*)) = -X 



and 



5,.(G ) 
5(G*) 



-log 2 



5,.(G ) 
5(G*) 



(9) 



In other words, to evaluate the consistency degree of a provisional prototype G* , 
we have used the quantity of information (in bits) necessary to express the class a 
given element of 5(G*) belongs to, i.e. 7(S(G*)) ; the completeness of G* , instead, is 



taken into account by simply counting the number of samples covered by G* , so 
preferring general prototypes versus more specialized ones. 



3.2. The Specialization Operators 

An important step of the FindPrototype procedure is the construction of a set Q of 
specializations of the tentative prototype G*. At each step, the algorithm tries to 
refine the current prototype definition, in order to make it more consistent, by 
replacing the tentative prototype with one of its specializations. To accomplish this 
task we have defined the following set of specialization operators which, given a 
prototype graph G* , produce a new prototype g* such that G* <G* '■ 

1. NODE ADDITION: G* is augmented with a new node n whose attribute is (j). This 
operator is always applicable. 
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2. EDGE ADDITION: a new edge («*,«*) is added to the edges of G*, where 
and M* are nodes of G* and G* does not contain already an edge between them. 

The edge attribute is (j). This operator is applicable if G* is not a complete graph. 

3. ATTRIBUTE SPECIALIZATION: the attribute of a node or an edge is specialized 
according to the following rule: 

• If the attribute is (|), then a type t is chosen and the attribute is replaced with 

f(P' p' \ 

' ’ " ' ’ *' . This means that only the type is fixed, while the type parameters 
can match any value of the corresponding type. 

• Else, the attribute takes the form where each P‘ is a (non 

necessarily proper) subset of ^ . One of the P‘ such that I is replaced 

with P‘ “{Pi! , where P‘ ^ P‘ . In other words, one of the possible values of a 
parameters is excluded from the prototype. 



Note that, except for the node addition, the specialization operators can be usually 
applied in several ways to a prototype graph; for example, the edge addition can be 
applied to different pairs of nodes. In these cases, it is intended that the function 
Specialize exploits all the possibilities. 



The function Specialize 

FUNCTION Specialize (G* ) // Returns the set of the 
direct 

// specializations of G* 

Q := 0 

FOREACH o IN SpecializationOperators 
IF Applicable (o, G*) THEN 
Q := QUApply(o, G*) 

END IF 
END FOREACH 
RETURN Q 
END FUNCTION 



4. Application and Discussion 

The method has been experimented on a character recognition problem, obtained 
by selecting from the ETL-1 Character Database m about 9000 random digits. We 
have partitioned the whole data set into a training set of 230 samples per class, and a 
separate test set of 680 samples per class. Each character of the database is 
represented by a 63x64 bitmap that we have described in terms of circular arcs by 
means of a preprocessing phase depicted in [ pj] . Eig. 2 illustrates the adopted 
description scheme; basically, we have defined two node types for representing our 
primitives (the circular arcs, here called strokes) and their junctions; the edges 
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represent the adjacency of a stroke to a junction. Node attributes encode the size of 
the strokes (normalized with respect to the size of the whole character), their shape 
(ranging from straight line segment to full circle) and their orientation; edge attributes 
represent the relative position of a junction with respect to the strokes it connects. 

Our learning algorithm has generated in about 27 hours a list of 136 prototypes, 
consistent and complete with respect to the training set. While the required time 
seems to be quite high, it has to be considered that other first-order symbolic learning 
algorithms are generally unable at all to produce results on training sets of the 
considered size: for instance, Quinlan’s FOIL was not able to run on our training 
set due to memory limitations. Moreover the samples are highly noisy and no effort 
has been taken to polish the training set, as is usually done when working with 
symbolic machine learning methods. 




a) 



connection(r^) 

a '1 

stroke{m,lb,nw) 



junction 



connection(v,a) 



stroke(vl,s,w) 



b) 



NODE TYPE ALPHABET 
T = I stroke , junction } 

^slmke ^ 

k - f) 

'^junction ^ 

p^stmke _ _ [sjb,b,hb,c] 

p^stmke _ orientation = {n,nw,w,sw,s,se,e,ne} 

c) 



EDGE TYPE ALPHABET 
T = { connection } 
k -7 

’^connection ^ 

p connection ^ j^.p^jection = {/,V,r} 

=y-projection = {b,h,a] 

d) 



Fig. 2. An example of a sample of the database, a) Its representation in terms of strokes and 
junctions: notice that the junction has been highlighted for the sake of clarity, b) The 
corresponding graph (topologically arranged to make clear the matching between the nodes and 
the strokes/junctions), according to c)-d) the formal description of node and edge types. Nodes 
of the ARG are used for describing both the strokes (type stroke) and the connections among 
them (type junction). Nodes associated to strokes have three parameters: the size can be very 
short, short, medium, long or very long', the shape straight, slightly bent, bent, highly bent or 
circular, the orientation can be one of the 8 directions of the compass card. Junctions have no 
parameters. The edges of the graphs are used for describing the position of a junction with 
respect to a stroke, by means of the projections of the junction and of the stroke on x and y 
axes: the junction can be on the left or on the right of the stroke or else the stroke is 
approximately horizontal', similarly, the junction can be below or above the stroke, else the 
latter is vertical. 
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Fig. 3. The coverage of the 136 prototypes found on the training set. The sequence of 
prototypes within each class matches the order they are sequentially generated. Notice the 
capability of generalization of our system: starting from an average of 234.2 samples per class 
in the training set, it has found out only 13.6 prototypes in the average, with a compression 
ratio greater than 17:1. 



The number of prototypes per class generated by our algorithm can be read in 
fig. 3, which shows also the coverage of each prototype: most of the classes are 
covered at 80% by using only a few number of prototypes per class. These prototypes 
have a high coverage and capture the major invariants of the character shapes inside a 
class. The remaining prototypes account for a few characters which, because of noise, 
are quite dissimilar from the average samples of their class. 

Table 1. The misclassification matrices evaluated on the test set (null values are not printed): 
white columns report the results when no form of reject is introduced, while gray ones refer to 
the discard of all prototypes covering less than 1% of the training samples. The recognition rate 
of the classes are reported as (bold) values of main diagonal, while the value at row i and 
column / (j^i) denotes the percentage of samples of class i erroneously attributed to class j. Last 
column reports the percentage of samples our system cannot assign to any class. Without 
rejecting, the overall recognition rate is 81.1%. When accepting some reject, misclassification 
decreases: e.g., though the recognition rate in the case of the class ’2’ is significantly lower, the 
number of samples erroneously classified as 7’ has drastically decreased (see the column 7j. 
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Table 1 reports the classification results on the test set; gray columns show the 
classification performance obtained after removing from the prototype set the 
prototypes which covered less than 1% of the training samples. These prototypes are 
more likely influenced by noise in the training set than by actual invariants of the 
class they represent. This removal causes the rejection of some samples but, as it can 
be noted, most of the rejected samples were previously misclassified; hence the effect 
of this pruning is an overall improvement of the classification reliability. 

5. Concluding Remarks 

In this paper we have presented a novel method for learning structural descriptions 
from examples, based on a formulation of the learning problem in terms of ARG’s. 
Our method, like learning methods based on first-order logic, produces general 
prototypes easy to understand and to manipulate, but it is based on simpler operations 
(graph editing and graph matching) leading to a smaller overall computational cost. 

At the moment, we are working on the optimization of the heuristic function in 
order to increase the prototype generality. We are also studying more sophisticated 
operators for attribute specialization, able to reduce further the learning time. Finally, 
a preliminary work is being done on a post-processing phase for removing from the 
prototypes unnecessary constraints introduced due to greedy nature of the algorithm. 
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Abstract. In many applications, objects are represented by a collection 
of unorganized points that scan the surface of the object. In such cases, 
an efficient way of storing this information is of interest. In this paper we 
present an arithmetic compression scheme that nses a tree representation 
of the data set and allows for better compression rates than general- 
purpose methods. 



1 Introduction 

Effective compression of images is a major research area, especially two dimensio- 
nal images and video images. Some work has also been done on three dimensional 
image compression IICHhhbl . A number of these works deal with medical appli- 
cations where all information, including that related to inner points, is 

relevant. However, there exist cases (for instance, most applications in industrial 
design), where the only relevant information is conveyed by the surface delimit- 
ing the object. In such cases, data are usually provided by a scanner that scans 
the surface and outputs a collection of points (given as vectors in the Euclidean 
space). Previous work has been done in order to build models that allow for a 
suitable geometric description of surf aces of thi s type so that a highly efficient 
codification of the image is achieved |HDD+94| . The cost of such algorithms is 
always a crucial issue. 

In the simpler case of two dimensional images, the contour of a simply connec- 
ted shape can be efficiently coded as a string of symbols. For this purpose it is 
enough to choose the direction in which the contour is covered (i.e., clockwise or 
counter-clockwise). Then, every point has two neighbors (a predecessor a suc- 
cessor) and the string representation is simply generated by writing the relative 
position between consecutive points. Usually, 2 dimensional figures are described 
as a collection of pixels. Because every pixel has 8 possible neighbors, one byte 
is enough to describe the relative position of a pixel with respect to the previous 
one. Therefore, if the shape is simply connected, one string contains all the in- 
formation needed to reconstruct the shape, except for a global translation that 
is not relevant for most purposes. With this method a considerable reduction in 
the file size needed to store the information has been reported |(IEJ9ti| . 

* Work partially supported by the Spanish CICyT under grant TIC97-0941. 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 457-^^^ 2000. 
@ Springer- Verlag Berlin Heidelberg 2000 



458 



J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco 



The method described above allows for simple and fast coding and decoding. 
However, in the case of 3D images, some important differences arise: 

1. The object is limited by a surface and the number of possible neighbors in 
a grid increases to 26. 

2. In the usual setting, the scanner precision is much higher than the minimum 
distance between points, so that they are no longer adjacent. 

3. Because a point in a surface may have more than two neighbors, a priority 
has to be established in order to define a path. Even so, it may be impossible 
to scan the object surface with a single path that goes only once through 
every point. 

The last fact suggests that a natural way to describe the surface is using a tree 
representation, as we describe in the following section. 



2 Data Representation and Modeling 

Given a sample of 3D vectors S = {ri, r 2 , ..., r|s|}, we may define the fully 
connected graph G where the node set is S and the weight of edge is 

the Euclidean distance between and r_,-. Then, we may build T, the minimum 
spanning tree (MST) of G, with any of the standard algorithms (see, for in- 
stance |CT;R,9flj i. Due to the geometric propertiet0 of the Euclidean space, the 
maximal number of neighbors is 12 (12 having probability 0). However, due to 
the fact that, in our case, the points are distributed on a surface (that is, locally 
on a plane), in practice less than 6 neighbors are always found and the tree width 
of T is at most five. 

Every node n in the MST with father m is labelled with a vector giving 
the difference d(n) = — r„. For the root node p, we take d{p) = r^. In our 

approach, we will compress the information using arithmetic coding |WNG87I 
ITTrml of the input. For this purpose we need a stochastic models describing the 
tree structure and the vector components contained in T: 

1 . In order to model the structure of the tree we will compute the probabilities 
Pk that a node expands in a given number k of siblings. This probability is 
estimated as the relative number of subtrees in T having k siblings, where 
the implicit assumption is done that it does not depend on the position of 
the node. 

2. In order to code the vector d(n) components, which are floating-point num- 
bers, we use three probability distributions Ei(di), ^ 2 (^ 2 ), .^ 3 (^ 3 ), one for 
each component. An efficient coding requires that these distributions should 
be analytically invertible. Therefore, rather than using a normal distribution 
we rather use logistic sigmoid functions 

Fi{x) = , \ , '' ( 1 ) 

1 -I- exp(-Ada: - /ii)) 

^ This question is related to that of packing spheres with optimal density. 
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whose density function is AFi(x)(l — Fi{x)). The parameters /i^ and Xi are 
evaluated from the average and standard deviation ai 0 of the components 
di of the nodes in T. 

3 Arithmetic Coding of the Data 

The arithmetic compression is performed by following a preorder traversal of T 
and coding at each node n the vector d(n) and the number of siblings of node 
n. For instance, the tree plotted in the figure the traversal will produce as input 
for the encoder the sequence 

d(l) 3 d(2) 0 d(3) 2 d(5) 0 d(6) 1 d(7) 0 d(4) 0 

Note that the sequence above consists of 28 input symbols as every vector d(n) 




contains three numbers (di(n), d 2 {n) and d^i^n)) and every number is conside- 
red a different symbol by the arithmetic encoder. The output file contains the 
parameter set as a header (the parameters fii and Xi together with the proba- 
bilities pk) with a negligible increase of the file size. Then, arithmetic coding is 
performed using alternatively the four models contained in the header. In order 
to avoid rounding errors during decoding (due to the 32 bit arithmetics), the 
tails of the distribution F are treated in a special way. Given a certain scanner 
precision e, there exists Xt > 0 such that 

F(xi + I)-F(xi~l) = 2-^^ (2) 

All X such that |a:| > Xt are considered outliers. In such cases (which are highly 
improbable), the code for the region \x\ > Xt is generated followed by the number 
X uncoded. 



^ The parameter Ai is related to ai by Xiai = tt\/3. 
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4 Results and Discussion 

A collection of five different files with a variable number of data (between 4 and 
19 thousand points) were compressed with the previously described method. For 
the sake of comparison, the raw data were also compressed using 1) a Lempel-Ziv 
compressor (gzip); 2) a Huffman coder (bzip2, which uses a Barows- Wheeler 
transform) and; 3) an arithmetic coder based on tree-grams described in |Mel91| . 
The results are presented in tabled Compression rate are defined as the quotient 
between file compressed file and file original size. 



dataset ff 


gzip 3-gram bzip2 


ours 


size (bytes) 


1 


0.36 


0.30 


0.29 


0.22 


108758 


2 


0.31 


0.29 


0.27 


0.22 


253615 


3 


0.38 


0.32 


0.34 


0.21 


339859 


4 


0.36 


0.31 


0.34 


0.20 


449768 


5 


0.21 


0.16 


0.11 


0.17 


484356 



Table 1. Compression rate for 5 different data sets. 



As shown in the table, our method favorably compares with the existing 
general purpose methods and allows for compression rates about 5 which are in 
consistently better than those obtained with the other methods. 

5 Conclusion 

We have implemented an arithmetic coder that compresses files of data contai- 
ning points describing the surface of three dimensional objects. The method uses 
a simple representation of the surface consisting on a tree of relative positions 
between points and modelizes this structure and the components. The time com- 
plexity coincides with that of the standard MST construction algorithms. The 
compression rates are comparable or significantly better than those obtained 
with standard methods. Further refinements on the data modeling may lead to 
improved results. 
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Abstract. The structural description ansatz often used for representing 
and recognizing complex objects leads to the consistent labeling problem 
or to some optimization problems on labeled graphs. Although this pro- 
blems are NP-complete in general it is well known that they are easy 
solvable if the underlying graph is a tree or even a partial m-tree (i.e 
its treewidth is m). On the other hand the underlying graphs arising in 
image analysis are often lattices or even fully connected. In this paper we 
study a special class of consistent labeling problems where the label set 
is ordered and the predicates preserve some structure derived from this 
ordering. We show that consistent labeling can be solved in polynomial 
time in this case even for fully connected graphs. Then we generalize 
this result to the “MaxMin” problem on labeled graphs and show how 
to solve it if the similarity functions preserve the same structure. 



1 Introduction 

Structural description is one of the most general methods for representing and 
recognizing complex real world objects and thus very popular in image analysis. 
Especially attributed or labeled graphs are often used as an effective means 
of structural description: A complex object is composed of primitives which 
have to fulfil some neighbourhood constraints. The primitives are represented by 
labels attached to the vertices of a graph whereas the constraints can be thought 
as predicates on pairs of labels and are attached to the edges of the graph. 
Recognition of objects modelled in this way could be divided into two stages: 
First, some e.g. local features are measured in order to obtain the primitives 
located in image fragments corresponding to the graph nodes r. Of course we 
cannot expect to have unique answers by local measurements. So the answer 
will be for instance a subset of possible primitives or some similarity function 
for each vertex. In the first case the next stage of recognition is equivalent to a 
consistent labeling problem, in the second case - to some optimization problem 
on labeled graphs. 

A well known and popular example of this kind of recognition are Hidden 
Markov Models used in speech recognition, where primitives are phonemes or 
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phoneme groups. Obviously almost all problems can be solved in linear time for 
these models because the underlying graph is a simple chain |B|. In contrast, the 
situation in image analysis is much harder: the graphs under consideration are 
often lattices or even fully connected. In general consistent labeling and most 
optimization problems on labeled graph^ are NP-complete. Therefore all known 
algorithms for solving these problems in the general case (i.e. without further 
assumptions) scale exponentially with the number of vertices of the graph 0. 

The aforementioned problems are easy solvable if the underlying graph is 
a tree or even a partial m-tree (i.e. its treewidth is m): then algorithms of 
complexity n\K\^ are known, where K is the set of used primitives and n is 
the number of vertices of the graph PEI. On the other hand these algorithms 
are not very useful in image analysis: an n x n rectangular lattice is a partial 
n-tree! 

Despite the importance of the consistent labeling problem its long history 
lacks attempts to investigate how its complexity depends on used predicates i.e.: 
Do there exist classes of predicates, so that the aforementioned problems are 
solvable in polynomial time even for fully connected graphs? 

In this paper we study a special class of consistent labeling problems where 
the label set K is ordered and the predicates preserve some structure derived 
from this ordering^ We show that consistent labeling can be solved in polynomial 
time in this case even for fully connected graphs. Then we generalize this result 
to the “MaxMin” problem on labeled graphs and show how to solve it if the 
similarity functions preserve the same structure. 



2 Consistent Labeling and Related Optimization 
Problems 

In this section we introduce a formal notion of the consistent labeling problem 
and some related optimization problems arising in image analysis for the sake of 
completeness and self consistence of this paper. 

Let 0 = {R, E) be an undirected graph with vertices R and edges E. Let 
K he a, finite set of labels (often called symbols in the context of structural 
recognition) . A labeling or symbol field is a mapping y: R K assigning a 
symbol y{r) to each vertex r G R. The set of all symbol fields is denoted by 
A(R,K). 

Neighbourhood constraints are represented by predicates Xij '■ K x K ^ 
{0, 1} attached to edges (r^, r^) G E oi the graph. These predicates define allowa- 
ble pairs of symbols on the edges: Only symbol pairs {k\, ^ 2 ) with \ij (^i; fe) = 1 
are allowed on edge (ri,rj) G E. The field of all predicates •) attached to 

^ From our point of view the stochastic relaxation labeling introduced by Hummel, 
Rosenfeld and Zucker |41il2j does not represent a realistic optimization problem, 
because they redefine consistence by some particular kind of local stability after 
transition to real valued similarity functions. 

^ First results on a narrower class were reported in 0. 
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the edges (r^, rj) G E of the graph Q is denoted by % and called a local conjunctive 
predicate (LCP) on Q. 

A symbol field y G A{R, K) is a solution of y if 

= 1 ( 1 ) 

for each edge (r^, Vj) G E. It is of course very easy to verify whether y is a solution 
of X by proving (QJ for each edge oft/. In contrast the consistent labeling problem 
is much harder: We have to prove whether a given y has solutions. A typical 
situation in image analysis is that given Q and x describing the model, local 
measurements “narrow” the set of possible symbols for each vertex: We obtain 
a subset Ki <Z K for each G R and have to prove whether x bas solutions y 
where y{ri) G Ki, Wri G R. 

Often we are in a “weaker” situation where local measurements give similarity 
values for symbols or symbol pairs, expressing fuzzy subsets rather than sharp 
subsets. In this case the predicates Xij are replaced by real valued functions 
fij-. K X K and some optimization problem is to be solved. Often it is the 
MaxMin problem: 



max min 
y^A{R,K) (ri,rj) 






(2) 



i.e. find the symbol field where the smallest similarity on some edge is as big as 
possible. Another typical problem is to find the symbol field with the highest 
sum of similarities: 



My(E),y(rj)] . (3) 

y&A{R,K) 

(ri.rj) 

3 Closed Predicates on Ordered Sets 

In this section we introduce the class of closed predicates which then will be used 
to form a class of consistent labeling problems. We split the definition of these 
predicates into two parts: First, we endow the symbol set K with an additional 
structure. Closed predicates are then those which preserve this structure. 

Suppose the set K of symbols is ordered. Let U be the system of all intervals 
of K i.e. subsets of the type {k G K \ ki ^ k ^ /c 2 }- Then U meets the definition 
of a hull system as used in universal algebra, because 

1. AT is an interval, i.e. K G Id. 

2. U is closed under intersections, i.e. if U,U' GU then U (lU' Gld. 

Usually subsets which are elements of a hull system, are called hulls or closed 
subsets. The closure of an arbitrary subset A is defined as the smallest closed 
subset containing A: 

C1(A) = p|{U gU\AcU} 

Examples of other hull systems are closed sets of a topological space or convex 
sets of a vector space. Our special hull system has another striking property: 
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3. Whenever B <ZU is & set of intervals with pairwise nonempty intersections 
i.e. U C\U' yf 0, Vt7, U' G B, then their intersection is not empty: Plt/GB U 

The structure of this hull system is inherited to every subset of K : Let Ki C K 
be an arbitrary subset, then 

Uk, = {V C Ki\3U €U:U' = U f\Ki} 

is a hull system fulfilling 1-3. 

Let X'- KxK {0,1} be a predicate on KxK. We interprete x equivalently 
as a subset x C K x K: {(fci, /C2) | x(^i) ^2) = Ij- A third equivalent meaning is 
to interprete x as a relation on K. Therefore the set of all predicates on K x K 
can be endowed with the operations x LI x ° X^ where 

(x°x') = 1 <t4> 3^3: x{ki,k^) = 1 and x'(fc3,fe) = 1 

denotes multiplication of relations. 

Now we are ready to define the class of closed predicates, i.e. predicates pre- 
serving the structure of the hull system U. Let tti and 7T2 denote the projections 
from KxK onto the first resp. second component: If M C K x K, then 

7Ti(M) = (fci I 3^2 : (fci, ^2) G M} and 7T2(M) = |/c2 | 3fci : (fci, /c2) G M} . 

Let X be a predicate on KxK considered as a subset with K\ = 7Ti(x) and K2 = 
7T2(x)- Then x induces the mappings : V{Ki) >->• V{K2) and F ~^ : V{K2) H> 
V{Ki) where V{K) denotes the power set of K . They map subsets of iCi C K 
on subsets of K2 C K and vice versa. These mappings are defined as follows: 

Ax(^i) = 7^2 [ {Vi X AT2) n x] and F~^{V2) = tti [ {Ki x F2) n x] 

where Vi C Ki and V2 C K2 (see Fig. Please remark that F~^ means 
the inverse of F^ only if it is invertible. Nevertheless Vi C and 

V2 C F^{F~^{V2)) hold for every x and every Vi C Ki and V2 C K2. 

Definition 1. Let x be a predicate on K x K where K is ordered and K\ = 
7 Ti(x), K2 = 7 T 2 (x)- Then x is called closed if F^ and F~^ map hulls oJUki oti 
hulls ofUK2 vice versa. 

Example 1 . Let \K\ = 4. The predicate depicted left in Fig.Qis closed, whereas 
that on right is not. 



Lemma 1. Let K he ordered and x> x' be closed predicates on K x K . Then 
X n x^ o-vd x° x' o,re closed predicates. 

Proof. In order to simplify subsequent steps of the proof we remark that inters- 
ections of closed predicates with product subsets are again closed predicates: 
Let X be a closed predicate on K x K and iXi, K 2 C K. Then the predicate 
X n {Ki X iF2) is closed. 
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Fig. 1. Illustration of the mapping F^: The intersections of the dotted lines with the 
components depict the borders of Ki — t:i{x) and K2 ~ 7T2(x)- The vertical grey bar 
depicts Vi X K2- Its intersection with y projected on K2 gives F^{Vi). 





Fig. 2. Two predicates on K x K depicted as subsets. The left is closed, the right is 
not. 



Let x" be closed predicates on K x K and x = x' ^x” ■ Without loss of 
generality we assume that 7 Ti(x) = 'n’2(x) = K. Let F, F' and F" denote the 
assotiated mappings from V{K) into V{K). 

Let us assume now that x is not closed. Then there must be an interval 
Ai C K, so that A2 = F {Ai) is not closed. Let k be an element of Cl(yl 2 ) not 
contained in A 2 i.e. {Ai x fc) fi x = 0- We consider the sets 

Bi = F'~\k) = TTi [{K xk)n x'] and Ci = F"~\k) = tti [{K x k) D x"] ■ 

According to our assumptions it follows then, that 

— Bi and Ci are nonempty and i?i fl Ci yf 0. 

— 01 (^ 2 ) C F'(Ai) holds and therefore Bi D Ai ^ 0. Similarly Ci fl Ai yf 0. 

Because Ai, Bi and Ci are closed (i.e. intervals) and hence Ai fl fl Ci yf 0, 
k G A 2 follows on contradiction. Hence X = x' A x” must be closed. 

Let us prove now that X = ° x" is closed whenever x' and x" are closed. 

Without loss of generality we assume that 7 Ti(xO = '^ 2 (x")- But in this case 

F(A) = F" {F'{A)) and F~^{A) = F'~^ {F"~^ {A)) 

and therefore x is closed. □ 
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4 Consistent Labeling with Closed Predicates 



In this section we consider LCP with closed local predicates. We will show that 
consistent labeling for such LCP can be solved in polynomial time without any 
restrictions on the underlying graph. We present a parallel algorithm that solves 
the problem and generates a weak description of the solution set. A sequential 
version of this algorithm is presented as well. 

Theorem 1. Let Q = (i?, E) be a fully eonnected graph with n vertiees and K 
be an ordered symbol set. The eonsistent labeling problem for a LCP x on G can 
be solved in polynomial time if x closed i.e. all loeal predicates of x o,re elosed. 

Proof. Consider the following iterative algorithm^ 



(0) 

Xij = Xij 



(t+i) 

Aij 



= x! n 



[n 



(i) 

ik 



{t)\ 

•xlj) 



( 4 ) 



The series of LCP reaches a fixpoint x* after at most n^|AT|^/2 iterations. 
This is because there are n^/2 predicates each one with \K\ binary entries and 
after each iteration at least one of this n’^\K\ /2 entries changes from 1 to 0 
(They never change from 0 to 1). 

The solution set of x is nonempty if every Xij is nonempty in the fixpoint. 
If at least one Xij is empty, then x has no solutions. This is true because each 
solution of x*'*^ is also a solution of Hence x has solutions only if x* has 

solutions. 

Let us consider the situation where all Xij are nonempty. Then all sets 7Ti(x*j) 
are nonempty. Furthermore, these sets coincide for fixed i: Suppose there is a 
pair j, k for which ni{x*j) ^ T^iix^h)- Then either 

X*ij n [x*ik o Xlj] + X*ij or xlk n [x*ij ° Xjk] + X*ik ■ 

But this contradicts 0) which is an equality for the fixpoint. Hence these sets 
coincide for each i and we denote them by K* = TTi{xtj)- ^o far we have shown 
that if X and hence x* have any solution y G A{R, K) then y(ri) G K* holds for 
all i. 

In order to prove the existence of solutions we will show that each solution 
of X* on a subgraph of G can be extended to a solution on the whole graph G- 
Whenever Gi = {Ri,E{) is an induced subgraph of G and y G A{R\,K) is a 
solution of X* on Gi, then y can be extended to a solution of x* on G- It suffices 
to prove this for the case that Ri — R\ri, i-O- Gi is obtained from G by deleting 
the vertex ri. Let y G A{R\,K) be a solution of x* on C/i, i.e. 

y{rf) G K*, Vr, G Ri and X*j[2/(^i), = 1, e Ri ■ (5) 



^ See Remark P for a more lucid explanation of the algorithm. 
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Let Ui C iL* be the sets 



Ui = TTl 



{Klxy{ri))f^x*u 



By Lemma n every and therefore y* are closed. Hence the sets Ui are hulls 
oilAxi- Let us assume that y cannot be extended to a solution on Q, then 



Hence there must be a pair of vertices i, j 1 for which Ui DUj = $ holds. But 
then 

x*3 n [x*ii o x*ij\ + x*j , 

which contradicts the fixpoint equality. Thus y can be extended to a solution 
on Q. □ 



Remark 1. In order to give a more lucid interpretation of dm we start with a 
closer look on the expression 



Xt] n {xik o xkj) ■ 

It represents a new predicate on the edge (ri,rj) with the following property: A 
pair of symbols (fci, ^ 2 ) is allowed by this predicate only if 

1. It is allowed by Xij- 

2. There is at least one so that (ki,k 2 ,k 3 ) is a solution on the triangle 
(r^,rj,rk). 

Hence dm can be interpreted in the following way: For the edge (ri,rj) and the 
current predicate xfj^ we check whether each pair of symbols (fci, ^ 2 ) allowed by 
xlf can be extended to an solution on each triangle {ri,rj,rk) containing the 

edge (ri,rj). If not, this symbol pair is not allowed by ■ Therefore in the 

fixpoint each symbol pair allowed on a edge can be extended to a solution on 
each triangle. 

At first glance it may seem that this condition is sufficient for the existence of 
solutions in the general case (i.e. if K is not necessarily ordered and the predicates 
are not necessarily closed). Therefore we give here a simple counterexample. 
Suppose we try to color a fully connected graph with four vertices (tetrahedron) 
with three colors. It is obviously impossible. But the initial predicates - the colors 
of two vertices connected by an edge must be different - remain nonempty and 
stable by applying (0. 



Remark 2. The proof of Theorem Q shows in particular that algorithm (0 gives 
more than only a “yes” or “no” answer whether x has solutions: The sets K* 
describe the solution set in some weak sense. For each k G K* there is at least 
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one solution y with y(ri) = k. And more generally, each solution on a subgraph 
Q' <zQ fulfilling 0 can be extended to a solution on Q. 

The same result can be obtained using the following sequential algorithm. 
This algorithm executes n — 2 steps (n = |i?|). At each step a vertex is 
removed from the actual graph Qi and the current LCP is replaced by a 
modified LCP on the remaining graph The algorithm performs the 

following operations in order to execute the i-th step: 

^ ° ^ 

The graph Gn -2 obtained after executing n — 2 steps has two nodes (r„_i, r„) 
and one predicate xi"-in • The LCP x has solutions on G if and only if xi"-in^ is 

nonempty. In this case we can choose any solution of xi"-in Gn-2 and extend 
it step by step to a solution of x on It is easy to prove the existence of such 
an extension by slightly modifying the proof of Theorem E The complexity of 
the algorithm is less than n^\K\^. 



5 The MaxMin Problem for Closed Similarity Functions 



The consistent labeling problem considered so far assumes that symbol pairs 
on neighbouring vertices have to fulfil some binary constraints. There are many 
applications where symbol pairs are “ranked” by real and not boolean numbers 
and local predicates are replaced by real valued functions fij'.KxK i— >■ ffi. 
associated with the edges (ri,rj) of G- The MaxMin problem is then 

max min/y [y(rd,2/(rj)] . (7) 

V&A(R,K) (i,j) 



We will show how to generalize the results obtained for consistent labeling in 
order to solve this problem. This is possible if all functions fulfil a closeness 
property. 

Let B^f be the binarization of a function / with threshold e: 






1 if f{ki,k 2 ) > e 
0 otherwise 



By binarizing real valued functions fij we obtain predicates on K x K. Let us 
assume that K is ordered and binarizations of all fij are closed predicates for 
each threshold e. Then the MaxMin problem can be solved as follows: 

1. Choose a threshold e and binarize all fij: Xij = B^fij. 

2. Solve the consistent labeling for x. 

3. If X has solutions, then increase e, else decrease e. Go to 1. 

This algorithm is not very “nice” and the question arises whether it is possible 
to “commute” binarization and consistent labeling. This is indeed possible. In 
order to verify this we use the following operations on functions f : K x K M. : 

[f ® g]{ki,k2) = mm[f{ki,k2), g{ki,k2)] 

[f ^ g]{ki,k 2 ) = maxmin[/(/ci,A:), g{k,k 2 )] ■ 
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It is easy to prove that: 

f,g@ > © » / ©5 
@VB,VV@VB,VV 

BJ, B,g@ > n » (BJ) n (B,g) = B,{f © G) 

and 

f,g@>®»f®g 

@VB,VV@VB,VV 



BJ, B,g@ > o » (BJ) o (B,g) = B,{f © G) 
are commutative diagrams. This leads to the following iterative algorithm solving 

(0 



AO) ^ 
Jij 



k 

It is clear that this algorithm reaches a fixpoint: = f\f 

from © that 



( 8 ) 

V(i,j). It follows 



max[/*(fc„%)] = c V(i,j) 

holds for the fixpoint /*, where c is the optimum of 0 . Choosing the the sets 



K* = {h\ = c} 

we obtain the solution of 0 as described in the proof of Theorem [3 

Remark 3. As well as in case of consistent labeling we can solve 0 equivalently 
by an sequential algorithm. Both algorithms coincide up to substituting 0 by 

/©■’ = /];’ ® [/j? ® /f ’] . 

Again the complexity is less than n^|A'|^. 



6 Conclusion 

We have shown that consistent labeling can be solved in polynomial time on 
fully connected graphs if the symbol set is ordered and the local predicates are 
closed, i.e. preserve a structure derived from the ordering. These results were ge- 
neralized in order to solve an important optimization problem on labeled graphs 
- the MaxMin problem. It is also solvable in polynomial time if its similarity 
functions preserve the aforementioned structure. Whether assumptions of this 
kind are sufficient in order to solve another prominent representative of opti- 
mization problems on labeled graphs - maximization of the sum of similarities 
m ~ remains an open and very intriguing question. 

Applications using the obtained results are subject of forthcoming publicati- 



ons. 
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Abstract. In this paper, we consider the general problem of technical document 
interpretation, applied to the documents of the Erench Telephonic Operator, 
Erance Telecom. More precisely, we focus the content of this paper on the com- 
putation of a new set of features allowing the classification of multi-oriented 
and multi-scaled patterns. This set of Invariant is based on the Fourier Mellin 
Transform. The interests of this computation rely on the possibility to use this 
Fourier Mellin transform within a “filtering mode”, that permits to solve the 
well known difficult problem of connected character recognition. In this paper, 
we also present an original technique allowing to compute an estimation of the 
orientation of each shape to be recognized. 



1 Introduction 

The current improvements of intranet structures allow large companies to develop 
internal communications between services. The representation of the heritage of huge 
companies like network managers firms is often represented through paper documents, 
which can be either graphic or textual. As a consequence, the sharing of these kind of 
information will stay very difficult as long as the storage format will not be digital. 
This explains the current development of studies concerning the automatic analysis of 
cartographic or engineering documents which comes as a result of the growing needs 
of industries and local groups in the development and use of maps and charts. The aim 
of the interpretation of technical maps is to make the production of documents easier 
by proposing a set of stages to transform the paper map into interpreted numerical 
storage [1][2][3][4]. An important step of this conversion process consists in the rec- 
ognition of characters and symbols, which often appear, on technical documents, in 
different orientation and size. In this paper, we focus our attention on an original tech- 
nique which allows the recognition of multi-oriented and multi- scaled characters and 

F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 472-481, 2000. 

© Springer-Verlag Berlin Fleidelberg 2000 
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strings. The application which is considered in this paper is the automatic analysis of a 
French telephonic operator documents, France Telecom. 

The paper will be organized as follows. In the second section, we will give a synthesis 
of the bibliographic sources dealing with multi-oriented and multi-scaled pattern rec- 
ognition problem. In the third section, we will then propose our new pattern descrip- 
tion tool, based on a set of invariant issued from the Mellin Fourier Transform. This 
part will also include the description of the interesting properties of this tool and it’s 
different modes of utilization that permits to solve the difficult problems of connected 
patterns recognition. The orientation estimation, which is an important factor in order 
to reconstruct strings will also be presented in this section. In the fourth section, we 
will present how this multi-oriented and multi scaled OCR (Optical Character Recog- 
nition), based on two utilization modes, is implemented in the particular case of con- 
nected patterns recognition .Preliminary results will be given in this part. Then, as 
usually, in the fifth section, we will try to have a critical point of view of our approach 
and to define some potential perspectives to this work. 



2 Characters and Symbols Recognition : Classical Approaches 

The problem dealing with character and symbol Recognition constitutes a difficult 
point of the current CAD conversion process. From a definition point of view, we will 
consider as “symbols”, the shapes the size of which is similar to the characters’ one. 
This definition takes into account the “alphanumeric characters” as well as the graphic 
symbols identified by A.Chhabra in [5]. The works integrating particular constraints 
like orientation or scale changes are less numerous in the literature. Let’s note the 
works of Deseiligny [6] in the particular case of maps, the one’s of Dori [7], based on 
NETS system, and Trier [8]. In this field of research, three main approaches may be 
distinguished. 

• The first suggests a preliminary computation of the shape’s orientation, and tries, 
through a normalisation and a rotation step, to obtain a pattern in a reference po- 
sition which can be introduced into a classical OCR system. However, methods 
based on such a strategy are not frequently used because of the lack of methods 
that enable a reliable computation of the orientation to be obtained. Moreover, 
distortions due to sampling errors appear during the geometric transformations 
that are needed to normalise the pattern. 

• A second approach consists in using a multi-layer feed forward classifier, fed by 
the original image of the pattern. In this kind of context, the classifier renders the 
problem invariant with respect to the desired transformations [8]. 

• The last approach, which is probably the most frequently used, consists in ex- 
tracting from the shape a set of descriptors, which are invariant to the desired 
transformations. An excellent description of the state of the art in this domain can 
be found in Trier [8]. Generally, it is possible to note that the features used to de- 
scribe patterns independently from their position, size, and rotation, can be split 
up into two groups, as shown in the following. 
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2.1 Descriptors Based on the Global Aspect of the Pattern 

Many features can be used to describe the global aspect of a shape. Since the works of 
Hu in 1961, invariant moments [9], which are based on combinations of regular mo- 
ments, have been very often used. One can thus cite Zernike or pseudo-Zernike mo- 
ments [10], Bamieh moments [11], and Legendre moments. These invariant moments, 
which can he extracted from a binary or a grey-scaled image, generally offer proper- 
ties of reconstructability, thus ensuring that extracted features contain all the informa- 
tion about the shape under study. Good comparative studies about moment invariants 
can be found in [12], both showing the superiority of Zernike moments in terms of 
recognition accuracy. Nevertheless, these studies also proved that moment-hased ap- 
proaches are sensitive to noise and that they are time-consuming, even if complexity 
optimisation methods can be found in the literature [13]. 

2.2 Descriptors Based on a Local Approach 

Besides the approaches presented above, a geometric invariant description can also be 
accomplished using features which are supposed to contain most of the pattern infor- 
mation. For example, contours are commonly used in order to obtain invariant de- 
scriptions of patterns through Fourier descriptors [14] or elliptic Fourier descriptors 
[15]. In terms of simplicity and robustness, the potential interest of these descriptors 
was shown by Taxt [16] through a comparative study between these descriptors. 
Structural invariant features can also be extracted from thinned characters. One can 
thus cite the number of loops, the number of T-joints or X-joints, the number of bend 
points. However, it has been shown that such features used alone do not lead to robust 
recognition systems [17]. Circular primitives, which are, by definition, well adapted to 
rotation invariant recognition, have been used in [18]. These are based on the analysis 
of the shape through a set of circles. A comparative study, available in [19], shows 
that they yield better results than Hu’s moments. In the field of invariant pattern rec- 
ognition, there is a consensus [12] [8] about the fact that the feature extraction method 
is probably the single most important factor. A large number of feature types is re- 
ported in the literature. As we have seen above, some of them are based on the global 
aspect of the pattern (e.g. moments) whereas others try to select particular points from 
the shape (e.g. circular primitives). Given this large number of existing methods, one 
could argue that it is not necessary to develop new invariant features. Nevertheless, 
comparative evaluation studies [12] have shown that the features presented above are 
not perfect in terms of recognition accuracy, especially when images are noisy. Also, 
one must notice the lack of literature concerning the analysis of connected characters. 
Indeed, in the document processing application field, exception made of a few papers 
presenting some recognition techniques based on topographic or topologic criteria [6], 
as far as we know, there is no work about the direct recognition of connected shapes 
without prior segmentation or introduction of some a priori knowledge. 
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3 The Fourier- Mellin Transform and Its Properties 

As we said previously, a strong constraint in the global interpretation of the document 
comes from the fact that characters and symbols can have any orientation and size. 
The consequence is that the recognition procedure to be applied must be invariant with 
regard to any combination of rotation and scaling of a pattern, i.e. any geometric si- 
militude transformation. Another strong constraint relies on the robustness of the rec- 
ognition procedure. In fact, after the binarization step, many characters are still con- 
nected either together, or to the network, leaving any classical pattern recognition 
technique useless. The strategy that we propose covers both constraints within a uni- 
form framework. It is based on the application of the generalized Fourier analysis to 
the particular geometric group of positive similitude. More precisely, we make use of 
the properties of the Fourier-Mellin transform, the properties of which are very inter- 
esting for our application. Basically, the technique developed herein is a combined use 
of the works of Ghorbel [20], Ravichandran and Trivedi [21]. First, we will recall the 
definition of the Fourier-Mellin transform (FMT). Then, we will recall the analytic 
prolongation of the FMT (AFMT) and a set of complete and stable similitude invariant 
features, first proposed in [20]. The properties of this set of invariant will then be 
exposed. 

S.l.The Fourier-Mellin Transform (FMT) 

Let f[r,d) be a real-valued function (the pattern) expressed in polar coordinates. The 
FMT of this function is defined as the Fourier transform on the group of positive si- 
militude: 

■ = | P“"'exp(-ig'6»)/(p,6»)^rf6» 

with q eZ, 0e R 

In this expression, i is the imaginary unit. It is well known that the Fourier-Mellin 
integral does not converge in the general case, but only under strong conditions for 



3.2. Analytic Prolongation of the Fourier-Mellin Transform (AFMT) 

In order to alleviate the above difficulty, Ghorbel [17] has proposed the use of the 
AFMT, defined as: 

Mf{v,q) = exp(-i^p) f[p,9) ^ d9 

with q eZ, ve R, and Og eR* 



( 2 ) 
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An important property of the AFMT (as well as the FMT) relies on the application of 
the shift theorem for the Fourier transform. Let 6<) = / (ap, 0 + y?) be a scaled and 

rotated version of / (p, 6>) , then we have the following : 

M g[v,q) = M f[v,q) 



Taking the modulus of both terms in Eq. (2) yields features which are invariant under 
any rotation of the pattern but not under scaling. To obtain scale invariance on this 
basis, one could use the following set of features : 



lf{v,q) 





(4) 



This set of invariant features provides a simple representation of shapes. Flowever, it 
does not respect the completeness property, i.e. there is no bijection between the dual 
representations of a single pattern, since the phase information is dropped. 

In [20], the following set of rotation and scale invariant features was proposed : 



If [v,q) = Mf (u, [ Mf (O, O)] <^0 [ Mf (o, l)] | Mf (o, l) 



(5) 



3.3. Properties 



Taking Eq. 5, if g(p,9)=f(ap,9+P), it can be easily shown that 7g(u,q)=/f(u,q), thus 
showing the invariance of the set of descriptors under change of scaling or rotation. 
Other properties of ihs&e, features rely on; 

(0 their completeness: given the quantities M^(o,o), My (o,l) and |/^ (u,^)| , it is 
possible to return to the whole set of FM coefficients | My (u, ^)| . and thus to recon- 
struct /(p,^) , by using the inverse AFMT in the following way : 

= Mf(v,q)p'^°*''^exp(iqe)dv 

•*R 

q 



{ii) their convergence : it is proven, in [17] that under the assumption that 
is a convergent set, there exists x G R, x> 1 such that 



\ 



l/x 



< -fGO 



(7) 



One important consequence of the completeness and convergence of this set of invari- 
ant features lies in the existence of a metric in the shape representation space, which 
enables a classification process. Another consequence is the possibility to determine 
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orientation and scale of shapes from the set of descriptors extracted, through a com- 
parison with descriptors extracted from reference shapes. Indeed, from Eq. 3 and 
through the knowledge of a reference shape, it is possible to obtain values of a and p 
(a being the scale factor and p the angle between the unknown shape and a reference 
corresponding shape). In order to compute these similitude parameters a and p, it is 
for example possible to minimize the quadratic error criteria 






a 



,(v,q) 



dv ■ 



( 8 ) 



On Fig. 1, one can see that the global minimum of the function is obtained for 
correct values of (a,p). Indeed, in this example, the criteria, calculated with 33 invari- 
ant features is minimized for a«l and p?s;100 which are quite correct results. 




Fig. 1. Quadratic error calculated between a reference shape and a 100 degrees rotated version. 
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Fig. 2. Left :Set of shapes used for the orientation estimation. Characters of the first 
column are used as reference shapes. Right: Estimation of the rotation angle in radians 
of the shapes from column 2 to 13 in comparison with the reference shapes 



On the right part of Fig. 2, we present the results which are obtained with this tech- 
nique for the estimation of the orientation of the different shapes of the left part, by 
using the first column as reference shapes. One can see 12 “groups” of stars, each of 
them corresponding to the common orientations of the 5 shapes. These preliminary 
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results are very interesting since the estimation of the orientation and the scale of a 
shape are fundamental information in order to reconstruct strings. 

3.4. Discussion 

In this part, we have recalled the definition of the FMT and it’s analytic prolongation 
(AFMT). We have seen that the use of the AFMT should he interesting for multi- 
orientation and multi-scale character recognition since it is a complete and convergent 
transformation. In terms of system implementation, this tool can he used within two 
sequential steps. The first one concerns isolated patterns, which can be easily extracted 
from a document through the use of a component extractor. The methodology and the 
obtained results concerning this step are precisely exposed in [22]. Details about ap- 
plication to 2-D square lattice images and obtained results on theoretical and practical 
data are given in this paper. These results are excellent since classification rates reach 
95 % on real data issued from technical documentation and since a comparison with 
classical approaches shows the superiority of the AFMT methodology [22]. The sec- 
ond step has for objective to detect and recognize characters or symbols which are 
connected to each other or to graphical parts of a document. This particular point is 
presented in the next section. 



4. Application of the AFMT for the Recognition of Connected 
Shapes on Technical Documents 



In this section, we will detail the implementation of the AFMT in the particular case 
of the recognition of connected patterns. This operation is performed thanks to the 
possibility of using the AFMT in a filtering mode, as shown in [21] for texture classi- 
fication. More precisely, the image is first convolved by a set of Fourier Mellin filters 
and one tries to locate the pixels for which a pre-specified response is obtained and 
which may be identified as the centroid of a pattern. The scheme for the analysis of 
images with connected patterns is shown of Fig. 3. 




Fig. 3. Scheme for the analysis of images containing connected patterns 
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As one can see on this figure, the first step consists in computing the 2-D Fourier 
transform of both the image and the set of filters. This technique is used in order to 
reduce computational burden by changing convolution into multiplication in the Fou- 
rier domain. Then, the Fourier transform of the image is multiplied by the Fourier 
transform of each filter, what enables, after an inverse Fourier Transform applied to 
the result of this multiplication, to obtain, for each pixel of the initial image, a set of 
Fourier Mellin descriptors (See Fig. 4: “Complex Image layers of descriptors”). Then, 
a classification process is applied for each of these vectors in order to assign a class to 
each pixel by comparison with the responses obtained for isolated characters. Of 
course, a reject is introduced in this classification process in order to avoid to have too 
many detection responses. This reject is defined through the use of a confidence value 
given by the classifier (strategy n°l on the figure 4). Finally, using responses and 
confidences given by the classification phase, the system takes a “final decision” for 
the image. In order to take this final decision, different solutions can be used. Indeed, 
thanks to the possibility to use response given for each pixel, it is also interesting to 
take into account spatial information since pixels in the neighborhood of the theoretic 
centroid of a pattern will also respond with the class of this pattern (strategy n°2 on the 
figure 4). On the next figure, we give the results obtained on synthetic images with 2 
different decision methods. It is important to note that, through the use of the classical 
isolated patterns approach, 15 images were not recognized. Results shown on this 
figure are quite logical. Indeed, one can see that the “3” is recognized in most images 
whereas “8” are logically recognized on some configurations. This is for a major part 
due to the fast decay of the magnitude of /z^_^(.,.)as the radius from the center point of 
the filter increases. Significant tests on France Telecom technical documents are cur- 
rently in progress in order to provide a reliable evaluation of these methods ; final 
results will be given in the final paper. 




Fig. 4. Classification Result obtained on synthetic images : left part - strategy 1, right part - 

strategy 2. 



5. Conclusion and Perspectives 

In this paper, we have proposed a original methodology, allowing the detection and 
recognition of multi-oriented and multi-scaled shapes. Since recognition of isolated 
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patterns using this methodology have already been proposed in a recent paper [22], we 
focus our attention in this submission on two particular innovating points. The first 
one deals with the estimation of characters orientation. From this point, the obtained 
results are very encouraging and the integration of this tool is currently in progress in 
order to reconstruct strings of characters. Indeed, on technical documents, “consistent” 
strings must have the same orientation and the same size. The second point deals with 
the detection and the recognition of connected patterns. In this submission, the applied 
process is precisely described and very encouraging preliminary results are shown. 
The perspectives of this work are numerous and concern different points. First, con- 
cerning the orientation estimation approach, a study is currently under way in order to 
find the optimal optimization methods between classical tools such Genetic Algorithm 
or “simulated annealing". Concerning the recognition of connected patterns, we are 
currently in test in order to choose the optimal decision method. All these considera- 
tions and the current results make us very optimistic concerning the future of this 
project since the results and the possible improvements of the methodology seem to 
indicate that classification rates should still increase of several points. 
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Abstract. This paper describes an algorithm to design a tree-structured 
classifier with the hyperplanes associated with a set of prototypes. The 
main purpose of this technique consists of defining a classification scheme 
whose result is close to that produced by the Nearest Neighbour decision 
rule, but getting important computation savings during classification. 

Keywords: Nearest Neighbour; Decision Tree; Classification; Gabriel 
Graph; Relative Neighbourhood Graph 



1 Introduction 

Non-parametric classification by means of a distance measure is one of the ear- 
liest methods used in Pattern Recognition. The Nearest Neighbour (NN) rule 0 
is an appropriate example of this kind of classifiers. Given a set of n previously 
labelled prototypes (namely, training set) in a d-dimensional feature space, this 
rule assigns to a given sample the same class than the closest prototype in the 
training set. 

Despite of the simplicity and effectiveness of the NN classifiers, it is well 
known that these schemes suffer from some practical drawbacks such as needing 
a lot of memory and computational resources for large training sets. This is 
why numerous investigations have been carried out on this technique in order to 
find the nearest neighbour of an unknown test sample with as few computations 
as possible. For example, several strategies have been proposed to devise fast 
algorithms to search for the nearest neighbour On the other hand, 

condensing methods jbiyuf2llf9l22j have been directed to reduce the training set 
size by selecting some prototypes among the available ones. Nevertheless, it is 
worth mentioning that these techniques cannot avoid a certain degradation of 
the classification performance. 

A third possibility consists of generating new samples instead of selecting a 
subset of prototypes BUM . This approach corresponds to a kind of supervised 
vector quantization, usually referred to as LVQ, where a discrimination purpose 
is taken into account. 

Finally, other alternatives focus on the use of some data structures which 
allow a more efficient classification than computing distances from a given test 
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sample to all prototypes in the training set. Most of these approaches are based 
on a certain partition of the d-dimensional feature space. In particular, kd tree 
methods 0111 ] are a popular tool to obtain a tree-like classifier from a decompo- 
sition of the feature space with a set of hyperplanes. A similar solution 0CSI is 
concerned to an implementation of the NN rule in the form of a neural network 
from the Voronoi Diagram (VD) associated with a set of points. 

Related to the latter group of the aforementioned proposals, it has recently 
been introduced an algorithm to design a decision tree equivalent to the NN rule 
from partitioning the feature space by means of a subset of the hyperplanes that 
define the VD associated with the training set m- Nevertheless, this technique 
has an important flaw: the use of VD. While VD can be efficiently constructed 
in 2 dimensions, there are no efficient algorithms for higher dimensional spaces. 
Obviously, this severely limits the applicability of such an approach. 

This paper presents an alternative scheme for designing a decision tree whose 
classification result will approximate to that of the NN rule. The aim is to over- 
come the drawbacks related to the construction of VD, which makes infeasible 
its application to real problems. Thus, it is here proposed to use a geometrical 
structure defined from a set of proximity graphs, whose construction is compu- 
tationally cheaper than that of the VD. 

The organization of this paper is as follows. Section 2 introduces the main 
concepts related to decision trees. Section 3 provides a brief overview of the 
proximity graphs used in this work. The algorithm to derive a decision tree- 
structured NN classifier from VD is outlined in Section 4. The scheme proposed 
in this paper is given in Section 5. Finally, the main contributions of this paper 
along with concluding remarks are summarized in Section 6. 



2 Decision Trees 

A decision tree (DT) 0 is a particularly useful tool for complex classification 
tasks because they are performed by a sequence of simple, easy-to-understand 
tests whose semantics is intuitively clear to domain experts. Using the termino- 
logy of books on data structures, the top node of a DT is called the root. In a 
binary DT, each node has either no child (in such a case, it is called terminal 
node or leaf, which is associated with a class), or a left child and a right child. 
Each node is the root of a tree itself. The trees rooted at the children of a node 
are called the left and right subtrees of that node. The depth of a node is defined 
as the length of the path from the node to the root. The height of a tree is the 
maximum depth of any node. 

A DT classifier uses some decision functions at non-terminal nodes to de- 
termine the class membership of an unknown sample. The evaluation of these 
decision rules is organized in such a way that the outcome of successive decision 
functions reduces uncertainty about the sample being considered for classifica- 
tion. Many variants of DT algorithms have been introduced in the literature. 
Much of this work has concentrated on DTs in which each non-terminal node 
checks the value of a single feature or attribute |21E| • When the attributes are 
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numeric, the tests have the form Xi > t, where Xi is one of the attributes of a 
sample and t is a constant. This kind of DTs are generally called axis-parallel, 
because the decision functions at each node are equivalent to axis-parallel hy- 
perplanes in the feature space. 

Another option is concerned to DTs that test a linear combination of the 
attributes at each non-terminal node. More precisely, let a prototype take the 
form X = X\,X2, ...,Xd,Cj where Cj is a class label and the xt's are real-valued 
attributes. Thus, the decision function at each node will have the form 



where oi, ..., Ud+i are real- valued coefficients. This kind of DTs are usually called 
oblique decision trees because those linear tests are equivalent to hyperplanes at 
an oblique orientation to the axes jdl 1 . In the case of oblique DTs, decomposi- 
tion of the feature space is in the form of polyhedral partitionings. It seems clear 
enough that, in many domains, axis-parallel methods will have to approximate 
the correct model with a staircase-like structure, while an oblique tree-building 
technique could capture it with a more accurate DT. 

Classification by means of a binary DT has got some attractive advantages. 
First, it establishes a decision technique that is easy to analyze and understand. 
Second, it uses a systematic way for calculating the class to assign to a given 
sample. On the other hand, DT is much faster in terms of computing time than 
a traditional NN approach since it is not required to compute distances from a 
test sample to all prototypes in the training set, which obviously constitutes an 
important benefit from a practical point of view (in fact, it just needs to do m 
comparisons in a tree of height m). 



3 Proximity Graphs 

Let X = {xi, . . . , a;„} be a set of n points in R‘^, where d denotes the dimensio- 
nality of the feature space. Then a proximity graph, say G = (V,E), is defined 
as an undirected graph with the set of vertices V = X, and the set of edges, 
E, such that (xi,Xj) € E if and only if Xi and Xj satisfy some neighbourhood 
relation. In such a case, we say that Xi and Xj are graph neighbours. The set 
of graph neighbours of a given point constitutes its graph neighbourhood. The 
graph neighbourhood of a subset, S C V, consists of the union of all the graph 
neighbours of every node in S. The Gabriel Graph (GG), the Relative Neigh- 
bourhood Graph (RNG) [14] and the Delaunay Triangulation (DTR) [I Y) are the 
most prominent examples of proximity graphs. 



d 
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Let d{-, •) be the Euclidean distance in The edges in a GG are defined as 
follows: 



(xj , Xj ) € ■v4 ' d (xj , Xj ) ^ d (^Xi , x/g) “t“ d {xj , x/g) 

Vx/g G k yf z,j 

In this case, Xi and Xj are said to be Gabriel neighbours. In other words, 
two points are Gabriel neighbours if and only if there is no other point from 
X laying in a hypersphere (namely, hypersphere of influence) centered at their 
middle point and whose diameter is the distance between them. 

Analogously, the set of edges in the RNG is obtained as follows: 



(xj, Xj) G E d{xi, Xj) < max [d{xi, Xk), d{xj,Xk)] 

Vx/g G X^ k yf 

Its corresponding geometric interpretation is based on the concept of lune, 
which is defined as the disjoint intersection between two hyperspheres centered 
at Xi and Xj and whose radii are equal to the distance between them. Thus, two 
points are relative neighbours if and only if their lune does not contain other 
points from X. 

Finally, the DTR of a set of points X = {x\, . . . , x„} is defined as the dual 
graph of the VD of the set X, which is a decomposition of R'^ into n cells. 
Gonsider all triangles formed by the points such that the circumcircle of each 
triangle is empty of other points. The set of edges of these triangles gives the 
DTR of the points. Therefore, two points in a DTR are connected with an edge 
if the boundaries of their Voronoi cells intersect. 

As an important property, the DTR is a supergraph of the GG and this one 
constitutes a supergraph of the RNG. They form a hierarchy of graphs as given 
bellow: 



RNG CGGC DTR 

In a similar way to the DTR and the VD associated with a set of points, it is 
also possible to define a dual structure for the GG and the RNG. Thus, the dual 
of a GG (or RNG) will be a sequence of convex regions covering the d-dimensional 
feature space. These regions are obtained by the perpendicular bisector of two 
points connected by an edge in the corresponding graph structure. 



4 Decision Tree Induction for NN Classification 

As previously mentioned, several methods try to improve the efficiency of the 
NN rule by using some kind of fast data structures, which can be induced from 
a partitioning of the feature space. Most of these approaches use either a DT or 
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a neural network configuration in order to represent the decomposition of the 
feature space into regions, each one assigned to a problem class. In fact, different 
alternatives essentially differ in the structure used as well as the partitioning 
criterion. 

This section focuses on the construction of a DT by means of a subset of 
the hyperplanes which define the VD associated with a set of prototypes m 
As a pre-processing step, the VD associated with the training set must be ge- 
nerated. Then, the information necessary to design the DT is derived from the 
hyperplanes corresponding to the Voronoi boundaries (that is, those hyperpla- 
nes separating two Voronoi cells which belong to different classes) related to the 
training set. Construction of the tree structure is done recursively. 

In a few words, the design approach divides the feature space into a number 
of convex regions such that the populations corresponding to each one become 
class homogeneous (that is, all prototypes belonging to one region are from the 
same class). The decision functions for non-terminal nodes of the DT correspond 
to the hyperplanes of the Voronoi boundaries, while leaves have the class labels 
assigned to each one of those resulting convex regions. 

From a practical point of view, there currently exists an important drawback. 
When the number of hyperplanes that define the Voronoi boundaries associated 
with a set of prototypes is large, the resulting DT can grow excessively. Ne- 
vertheless, note that this can be yet much more emphasized when used all the 
hyperplanes of the entire VD derived from the training set, like in the case of the 
neural network design algorithm ^j. Furthermore, it would be possible to reduce 
the DT size by previously applying some prototype selection procedure 0 over 
the original training set. 

Nevertheless, the most important limitation of this method is due to the 
heavy computational loads required for constructing the VD. Consequently, it 
is to be admitted that the use of this technique is for all practical purposes of 
little value. 

5 A Decision Tree Approximation to NN Classification 

The approach proposed in this paper belongs to those methods which construct 
some kind of structure for efficient NN classification from partitioning the feature 
space. The design algorithm provided here requires a pre-processing step in which 
the dual of the GG (or RNG) associated with the set of prototypes is computed. 
The basis of our alternative is focused on the idea that only those hyperplanes 
separating two graph cells which belong to different classes (from now on, called 
hyperplanes from the graph boundaries, or simply splits) are necessary for an 
accurate decomposition of the feature space. In fact, this idea is according to 
condensing methods (in particular, to Toussaint’s condensing M)- Thus, the 
number of hyperplanes used to construct the classifier could be considerably 
reduced. 

The algorithm proposed here follows a top-down methodology to construct a 
particular DT. It basically consists of a splitting stage to generate non-terminal 
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nodes, a stopping criterion, and a procedure for assigning a class label to each 
resulting leaf. Thus, the design approach divides the feature space into a number 
of convex regions such that the populations corresponding to each one become 
class homogeneous (that is, all prototypes belonging to one region are from the 
same class). These regions are derived from the hyperplanes corresponding to 
the already defined graph boundaries related to the training set. This process is 
straightforward and gives rise to a DT whose classification result is approxima- 
tely the same as that of the NN rule. 

Let X denote a set of n prototypes belonging to L distinct classes in a d- 
dimensional feature space and, let H = {Hi,H2, be the set of the m 

hyperplanes which define the graph boundaries associated with X. Next, the 
design procedure will be discussed in detail. 



5.1 Splitting 

The algorithm begins considering the entire d-dimensional feature space as a 
unique convex region, R\. Taking into account that any hyperplane Hi can 
divide the space into two half-spaces (namely, the positive half-space Hf and 
the negative one H~), the first hyperplane. Hi G H, will divide Ri into two 
partial convex regions. In such a way. Hi is designated as the decision rule for 
the root node. 

Further, the procedure goes on taking the next hyperplanes in the set H . At 
each iteration, the split rule will consist of testing whether the given hyperplane 
divides some of those partial convex regions Rj, that is, those regions defined 
from the decision rules (or hyperplanes) corresponding to the nodes in each path 
of the current DT, or not. When a hyperplane Hi results in a new partition over 
any of those regions Rj, the DT will be expanded with a new non-terminal node, 
whose decision rule will correspond to the hyperplane Hi just tested, i.e., the 
convex region Rj is divided into two new smaller convex regions. This process 
is iterated by using all the hyperplanes of the graph boundaries Hi G H. 

The problem just introduced here is solved by using the Simplex algorithm 
for linear programming. In order to test whether a hyperplane Hi G H may 
partition a convex region or not, the algorithm formulates a linear program 
which consists of optimizing an arbitrary linear objective function subject to a 
set of constraints. These constraints will be the inequalities (i.e., constraints of 
the form > or <) corresponding to the half-spaces which define the given convex 
region, and the equation (i.e., a constraint of the form =) of that hyperplane 
Hi. The meaning of such a linear problem is as follows: the hyperplane Hi does 
not divide a convex region if and only if the corresponding linear program is not 
feasible. 

5.2 Assignment of Class Labels to Leaves 

With respect to the procedure for labeling the leaves obtained from the splitting 
stage, a sample on the inside of each convex region (that is, a path in the DT 
from the root node to a leaf) is generated, and then its nearest neighbour among 
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the samples in the input training set is searched. Finally, the class label of its 
nearest neighbour is assigned to the corresponding convex region. 

In order to generate a sample belonging to a certain convex region, the edges 
of such a region are moved inside an arbitrary distance, say S, and then any 
vertex of the resulting region is taken. This vertex will be the optimal vector for 
a new linear program with an arbitrary objective function subject to a set of 
constraints (in this case, those corresponding to the inequalities of the half-spaces 
which define the inside convex region). 



5.3 Pruning 

The size of the resulting oblique DT can be reduced by means of a simple pruning 
approach. That is, if both children nodes (leaves) of a non-terminal node have 
assigned the same class label, it means that the hyperplane associated to such a 
decision node separates two convex regions which belong to the same class and 
therefore, it is indeed irrelevant for the discrimination process. Hence, such a 
non-terminal node is assigned to the class attached to its children nodes and, 
these two nodes are deleted from the previous DT configuration. Obviously, other 
pruning alternatives can be applied to this DT design algorithm. 

5.4 The Overall Algorithm 

After a pre-processing step to compute the dual of the GG (or RNG) of the sam- 
ples in the training set, as well as to extract those hyperplanes corresponding to 
the graph boundaries, the approach just described can be formally summarized 
in the following algorithm: 

1. Let i? denote the set of convex regions in the tree structure 
after each iteration. Begin with the entire d-dimensional fea- 
ture space as the unique region in R. 

2. Splitting. For each Hi G H, do 

- R' = 0 

— For each Rj G R do 

• If Rj can be linearly separated by Hi'. 

A. Designate Hi as decision rule for the non- 
terminal node represented by Rj. 

B. Make Rj+i = Rj fl H~ and Rj+2 = Rj H Hi^ the 
convex regions that represent the children nodes 

of Rj. 

G. Delete Rj from R and put Rj+i, Rj+2 in R' ■ 

• Else put Rj in R' . 

- R = R' 

3. Assignment. For each final convex region Rj G R do 

— Assign to Rj a class label. 

4. Pruning. Prune the resulting oblique DT. 
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6 Conclusions and Extensions 

In this paper a technique for the design of a specific kind of binary DT, whose 
classification result is approximately the same as that achieved by the traditio- 
nal NN rule, has been introduced. More specifically, this DT is equivalent to 
the proximity-graph-based non-parametric classification rule. The description of 
such a classifier and its empirical analysis are widely covered in m- 

The DT induction algorithm is based on the use of the graph boundaries 
(that is, hyperplanes separating graph cells with different class labels) derived 
from a set of prototypes and, consists of a recursive process which decomposes 
the feature space into a number of convex regions, each one finally assigned to 
a certain problem class. Thus, the decision functions for non-terminal nodes of 
the DT correspond to the hyperplanes of those graph boundaries (or splits), 
while leaves have the class labels assigned to each one of those resulting convex 
regions. 

The cost of using a conventional NN approach is high since it requires to 
calculate distances from a point to all prototypes, while using the DT is expected 
to be lower: in fact, you only need to compare a point with a hyperplane at most 
m times, being m the maximum depth of the tree. Moreover, note that this 
technique may still be applied in conjunction with condensing schemes in order 
to obtain an even more computationally efficient NN classifier. 

Future works include investigation of some stopping criteria that allow to 
obtain a more efficient DT design algorithm without an important degradation 
in the classification performance. On the other hand, it is still necessary to 
compare this approach with existing methods, such as kd trees. Finally, it is also 
interesting to perform a theoretical analysis of the algorithm. 
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Abstract. This paper develops a Bayesian mixture model approach to 
discrimination. The specific problem considered is the classification of 
mobile targets, from Inverse Synthetic Aperture Radar images. Howe- 
ver, the algorithm developed is relevant to the generic classification pro- 
blem. We model the data measurements from each target as a mixture 
distribution. A Bayesian formalism is adopted, and we obtain posterior 
distributions for the parameters of our mixture models. The distributi- 
ons obtained are too complicated for direct analytical use in a classifier, 
so a Markov chain Monte Carlo (MCMC) algorithm is used to provide 
samples from the distributions. These samples are then used to make 
classifications of future data. 

Keywords. Bayesian inference. Discrimination, Inverse Synthetic Aper- 
ture Radar, Markov chain Monte Carlo, Mixture models. Target reco- 
gnition. 



1 Introduction 

This paper describes a Bayesian mixture model approach to discrimination [11) . 
The generic discrimination problem that we consider is one where we are given a 
set of training data consisting of class labelled measurements (and possibly some 
unlabelled measurements), and then want to assign a previously unseen object 
to one of the classes, on the basis of the measurements made of that object. 
Specifically, in this paper, we illustrate the approach by considering automatic 
target recognition (ATR) of Inverse Synthetic Aperture Radar (ISAR) images 
from 3 main classes of mobile targets. 

The mixture model approach to ATR aims to initially provide, for measure- 
ment data X, and classes j, estimates of the class-conditional probability densities 
of the data, p{x\j). We can then produce estimates for the posterior probabilities 
of class membership, p{j\x)^ using Bayes’ theorem, p{j\x) oc p{x\j)p{j)^ where 
p{j) are the prior class probabilities. 

Estimating the posterior probabilities of class membership offers a number of 
advantages over producing class membership decisions only. These advantages 
include giving a measure of confidence for our class predictions, and the ready 
ability to combine the probabilities with additional information, such as intel- 
ligence reports. Furthermore, since after classifying a target we need to decide 
upon a course of action, we can incorporate the probabilities into a multilevel 
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model that reflects the whole decision making process. For instance, by con- 
sidering the expected posterior loss of decisions, we can take into account the 
different costs involved in making a classification. 

Estimation of the class-conditional probability densities, p{x\j), is complica- 
ted by the high-dimensionality of our radar target data. Non-parametric methods 
of density estimation, such as kernel-based methods, would require unrealisti- 
cally large amounts of training data for accurate density estimates^Hl) while 
parametric methods, such as a simple Gaussian classifier, might impose a speci- 
fic form on the density that is too rigid for the problem. Mixture models provide 
a compromise between the two methods; attempting to provide enough flexibi- 
lity to accurately estimate the densities, but imposing enough structure that we 
can train with realistic amounts of data. 

Further motivations for the mixture model approach [4p I b] in this application 
arise from the following observations: 

1. The probability density function of the radar returns for a single target can 
be expressed as an integral over the angle of illumination of a conditional 
density of a simple form {e.g. Gaussian, gamma), with the mixture distribu- 
tion arising as the approximation of this integral by a finite sum. 

2. Additional effects, such as robustness to offsets in position, and amplitude 
scaling, can be readily incorporated into such a model. 

Previous workjIS| uses the Expectation-Maximisation (EM) algorithm to 
estimate the parameters of class distributions that are gamma mixture models. 
Hastie and TibshiranijO] use an EM-algorithm on class distributions that are 
Gaussian mixture models, making the assumption of a common covariance ma- 
trix across all the mixture components. Laskey jE] formulates a Bayesian ap- 
proach to modelling classes as mixtures of Gaussian distributions, but uses the 
EM algorithm to estimate the maximum a-posteriori parameter values only. The 
approach presented here is a generalisation of the work of Lavine and West 
who look at a Bayesian approach to classification, where each class is distributed 
as a single multivariate Gaussian. 

2 The Bayesian Mixture Model Approach 

2.1 Introduction and Notation 

We consider classification of an object into one of J distinct classes, on the basis 
of a d-dimensional data measurement of that object. The probability density 
function for the d-dimensional data, a;, can be written as: 

,7 

= ^ ( 1 ) 

7 = 1 

where 9 = (di, . . . , 0j) is a vector of the prior classification probabilities for each 
class, with components satisfying X)j=i ~ P{^\j) is the class-conditional 

probability density for data from class j. 
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The class-conditional densities are also modelled by mixture distributions, 
with the j-th class having Rj components, which we refer to as subclasses: 

Rj 

p{x\j) = ■ ( 2 ) 

r— 1 

TTj = (tTj 1 . . . , '^j,Rj ) represents the prior subclass probabilities within class j] 
i.e. TTj is the mixing probability for the r-th subclass of the j-th class, satisfying 

D . 

Y^r=i = 1- We denote the complete set by tt = {nj, I < j < J}. 

The distribution p(x\j, r) represents the probability density of the data within 
a given subclass r, of a class j. We make the initial assumption that we have 
independence between the components of the data vector x, conditioned on the 
class and subclass. For ISAR images this corresponds to making an assumption 
of independence between the Radar Cross-Section fluctuations of any pair of 
pixels |S|. Note that this independence assumption for each component does not 
extend to an independence assumption for the mixture distribution as a whole. 
We take Gaussian forms for these distributions, with means Pj^r,i and variances 
<Jj ri: where I = 1, . . . ,d. For a given class and subclass these are represented by 
the vectors fij^r and Sj^r- The sets of all means and variances are represented 
by p and E, respectively. 

We have n observed d-dimensional independent training data samples y = 
{yi , . . . , j/„}, and introduce two types of classification variable for these data. The 
overall class allocation variables are denoted Z = (Zi, . . . , Z„), and the subclass 
classification variables, z = (zi,...,z„). These are such that {Zi = j,Zt = r) 
implies that the observation indexed by i is modelled to be drawn from subclass 
r of class j. Zi is known for our labelled training data, but unknown otherwise. 
Zi will always be unknown, and is physically unimportant. 

2.2 Our Approach 

The Bayesian approach to estimating the parameters of mixture models offers a 
number of advantages over methods based on maximum likelihood, such as the 
EM algorithm. Not least is the elimination of the problem of unboundedness of 
the likelihood function, that is frequently ignored in maximum likelihood tech- 
niques. There are also the standard arguments in favour of Bayesian techniques, 
such as the ability to cope with additional prior information, perhaps elicited 
from expert knowledge, and the production of confidence intervals for the para- 
meters estimated. There is also the potential for using hyper-parameters in our 
prior distributions for the mixture model parameters, to account for differences 
between training and test data, such as variations in the vehicle fit, or different 
types of vehicle from the same generic class. 

In the work presented here, the number of components to use in each class 
mixture distribution, has still to be addressed in a full Bayesian manner. At 
the moment we hold the number of subclasses fixed throughout the algorithm. 
Reversible jump Markov chain Monte Carlo techniques'^] would be extremely 
complicated due to the high-dimensionality of our data. 



494 K. Copsey and A. Webb 



2.3 Model Details 



Prior distributions. The complete prior distribution for the mixture model 
parameters is: 

,7 

p{fi, S,e,Tr) = p{n,S)p{9)Y[p{T^j) ■ (3) 

i=i 

As well as making an assumption of independence between the components 
that make up the vectors fij^r and Spr, we also make the assumption that 
are mutually independent over all classes and subclasses. The com- 
ponents pLj,r,i and cr'^ri given independent normal-inverse gamma priors: 






2 

^ 3,^,1 



k^,r4,o) ; (4) 



for fixed means rrij^r,i,o, precision parameters hj^r,i,o, degrees of freedom 
and scale parameters The inverse gamma distribution is parameterised 

so that the expectation is Vj,r,ipl (ypr,ip — !)• 

The values of these hyper-parameters are partially chosen with the aid of the 
training data, and for our specific application also the known angles of illumina- 
tion for the labelled data, giving a combination of priors from expert knowledge, 
and data dependent priors. 

We further assume independence of the priors for 9 and nj, j = 1, . . . , J. For 
both cases we take Dirichlet priors, with 9 ^ D(ao), where Oq = (oi^o, • . • , 
and TTj ~ D{bjp), where bj^o = ; ^pRjp)- The hyper-parameters oq and 

0 are held fixed. 



The posterior distribution. The likelihood function for the problem is writ- 
ten: 

n ( J Rj d \ 

p{y\p, A", 6», 7t) = ^ 9j ^ N{yiX, I . (5) 

i=l = l r=l l=l J 

Bayes’ rule gives the following relationship between the posterior, prior and 
likelihood: 

p(/7. A, 6>, 7r|y) oc p{y\p, A, 6», 7r)p(^, A, 6», tt) , (6) 

which due to the multiplication of summations in the likelihood function, gives 
a posterior distribution on which exact analytical inference cannot be made. 

In particular, calculation of the normalisation constant is computationally 
infeasible, as are calculations of various statistics of interest, such as the means 
and variances of the parameters. To maintain a full Bayesian approach to the 
problem, we propose to draw samples from the posterior distribution. Since we 
cannot sample directly from the distribution, we use a Markov chain Monte 
Carlo (MCMC) algorithm^), known as a Gibbs sampler|2|. 
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3 MCMC Algorithm 

3.1 General 

The Gibbs sampler is a means of sampling from a distribution that is too compli- 
cated for direct sampling, but which is such that we can split the variables into 
conditional distributions, each of which can be sampled from. To make use of the 
Gibbs sampler in our problem, we extend our set of random variables (/x, E, tt, 6) 
to include the allocation variables (Z,z), and sample from the posterior distri- 
bution p{^,E,TT,9,Z,z\y). To do this we divide into three distinct groupings, 
(p,E), (tt,9) and (Z,z), obtaining posterior probabilities for each group, that 
are conditional on the other two groups. 

3.2 Conditional Distributions 

The mixture components. Given the allocation variables {Z, z), we can make 
use of the fact that the data y consist of classified independent samples from the 
k = Ri + ■ ■ ■ + Rj subclasses, giving: 

p{^J‘lZ\y,0,^T,Z,z) = p{p,E\y,Z,z) . (7) 

We define Gj^r = = j,Zi = r)}, the set of indices of data elements 

that have been assigned to subclass r of class j, and gj^r to be the cardinality of 
Gj^r- We also define ypr,i = ^ J2ieOj,r ~ yj,r,iY- 

Our independent normal-inverse gamma priors then give rise to independent 
normal-inverse gamma posterior distributions 

Pj,r,l\{o'j,i,i,y,Z,z)^N(jnj^j;lT(^j^r,l/^j,r,l) I (8) 

and: 

(^lr,l\(.y,Z,z) l/Ga,{vpr,l,Vj^r,l) , (9) 

where: 

“1“ 9j,r : 

~ “1“ 9j,ryj,r,l)/hj^r,l j 

“1“ 9j,rl‘^ i 

The allocation probabilities. Given the allocation variables {Z^z)^ the class 
and subclass allocation probabilities, (6*,7r), will be independent of 
Thus we have: 

p{d,Ay^p^^^z,z)=p{9\z)p{-K\z,z) . (11) 

For the class allocation probabilities, defining pj = 9j,r, we have: 

0\Z ~ D{a), where a = (ai, . . . , aj), with aj = pj -I- . (12) 

For the subclass allocation probabilities we obtain the following independent 
distributions, for j = 1, . . . , J: 

TTj\{Z, z) ~ D{bj) where bj = {bj i, . . . , bj n.), with bj ,. = gj j. + bj^r,o ■ (13) 
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The allocation variables. Since the data yi are conditionally independent 
given (/i, S, 0, tt), the pairs of allocation variables {Zi, Zi) are conditionally inde- 
pendent given (y, y, E, 9, tt), so: 

n 

v{Z,z\y,y,E,e,TT) = . (14) 

If Zi is unknown for the i-ih data vector, we have: 

Rj 

P{Zi= j\yi,P,E,e,Tr) (x9j'^TTj^rP{yi\Pj,r,Z:j^r) , ( 15 ) 

r—1 

For the subclass allocation variable Zi, we have: 

p{zi = r\Zi= j,yi,y.,E,e,Tr) OCTTj^rP{yz\Pj,r,Z:j^r) ■ ( 16 ) 

3.3 Algorithm Specifics 

To start our algorithm, we take initial allocation vectors, Z^^'^ = (z[^\ . . . , Zn'^), 
and z^^'> = (zj^t . . . , Zn'^). Some or all of the elements of vector Z are actually 
known, from our labelled training data, in which case we use these known values. 
We describe the algorithm in terms of the i-th iteration, which updates the set of 
parameters and allocation variables 

from the end of the (t - l)-th iteration, to (^W, IfW, ttW, ^W): 

1. Draw a sample E^^^) from p{y, E\y, Z^^~^'> , using OSJ and (|SI). 

This gives an updated set of parameters for the subclass distributions. 

2. Sample (6»(*),7 tW) from p(6», 7r|y, TW, using (EJ and JEJ. 

This gives an updated set of the class and subclass allocation probabilities. 

3. Sample (Z«,zW) from p(Z, , TT^®)), using m and (Cni), gi- 

ving an updated set of class and subclass allocation variables. Note that for 
our class labelled data, we set the class allocation variables to their known 
values, and only re-estimate the corresponding subclass allocation variables. 

After an initial burn-in period, during which the generated Markov chain re- 
aches equilibrium, the set of parameters E^"^\ 9^"^\ z*-®^), can be re- 

garded as dependent samples from the posterior distribution p{y, E, 9, tt, Z, z\y). 
To obtain approximately independent samples we leave a gap, known as the de- 
correlation gap, between successive samples (he. we only retain a sample every 
Z-th iteration of the algorithm, where I is an integer greater than one). If we 
are only concerned with ergodic averages, we actually obtain better variances 
if we do not sub-sample the output of our Markov chain. However, if storage 
of the samples is an issue, we may like to leave a decorrelation gap, so that we 
can be sure to explore the full space of the distribution, without having to keep 
thousands of samples. 
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3.4 Classification of the Observations 

The training data. We now obtain formulae for the posterior classification 
probabilities of the training data (he. already observed measurements, y), re- 
membering that some of this training data may be unlabelled in class. For not- 
ational ease we let D denote the combination of our data measurements y, and 
any known class allocations for this data. 

Rather than approximating the posterior distributions directly, as follows: 

p{Z,=j\D)^^Y.^{z\^^ =j) , ( 17 ) 

m—1 

where N is the number of MCMC samples, we use Rao-BlackwellisationPj to 
provide more efficient estimates; by using an approximation to the posterior 
marginalised distribution p{Z\D), based on our MCMC sampled values: 

1 ^ 



where using l l I hll we have: 















( 19 ) 



Future observations. We now consider classifying a previously unseen obser- 
vation, j//, by looking at the posterior probabilities: 

P{Zf = j\D,yf) ^ P{Zf = j\D)p{yf\D,Zf = j) , (20) 

where: 

1 ^ 

P{Zf = j\D) = E{e,\D) ^ > ( 21 ) 

' S = 1 

and: 

1 " 

p{yf\D,Zf = j) ^ - ■ (22) 

' S = 1 

If we have reason to believe that the spread of future data between the diffe- 
rent classes is likely to be different to that in the training data, we can replace 
the expressions for P{Zf = j\D) in l|2 1 II with our modified beliefs. We write 
p{yf\D,Zf=j,Z(^\z^^^) as a mixture distribution: 

Ri 

p{yf\D, Zf = j, z(^)) = Y.{p{zf = t\D, Zj = j, z(^)) 

r—1 



xp{yf\D,Zf=j,Zf = r,Z^^\z^^'^)} ,( 23 ) 
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where: 

p{zf = r\D,Zf=j,Z^^\z^^^) = E{n,,r\D,Z^^\z^^^) , (24) 

and p{yf\D,Zf = j,Zf = r, Z^^\ is the predictive density, for data drawn 
from subclass r of class j, using component distributions determined by the 
MCMC sample outputs, Some calculations P|, show that this pre- 

dictive density is given by a product of independent Student-t distributions, 
with the Z-th component having 2^4 ^ ^ degrees of freedom, location parameter 

and scale parameter The parameters being 

defined in (mu using allocation variables {Z^‘^\ z^^'>). 

4 Experimental Results 

We illustrate the use of the algorithm on real ISAR data (see Fig.QJ, consisting 
of images of vehicles from 3 main types of battlefield target, which we denote 
by classes 1, 2 and 3. Our training data consist of approximately equal amounts 
of images from each of the 3 classes (about 2000 per class), collected over single 
complete rotations of the vehicles, at a constant depression angle. Our test data 
consist of 6 sets of approximately 400 ISAR images, collected from single com- 
plete rotations of 6 vehicles. Of these, datasets B-hd and B-er are the vehicle 
from dataset B imaged at a higher depression angle and with the engines running 
respectively, while the remaining sets correspond to different vehicles within the 
same generic class type. Unfortunately, we do not have an independent set of 
test data for class 2, negating the possibility of obtaining a meaningful single 
measure of performance from our test data. 

Our ISAR data is, after some initial pre-processing, 38 pixels in range by 
26 pixels in cross-range, giving an overall dimensionality of 988. To reduce this 
down to a slightly more manageable level, a principal components analysis ^1] 
has been conducted on the data, and we actually use only the first 35 linear 
principal components of each image vector. The algorithm has been run with 
12 subclasses per class, to draw 1000 samples with a decorrelation gap of 10 
iterations, after a burn-in period of 10000 iterations. 




Fig. 1. Typical vehicle and ISAR image from our data set. 
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Table 1. Classification rates for trai- 
ning datasets. 



Data 

set 


True 

class 


Predicted class 


1 


2 


3 


B 


1 


99.1 


0.6 


0.3 


A 


2 


0.6 


98.2 


1.1 


C 


3 


0.0 


1.1 


98.8 



Table 2. Classification rates for test 
datasets. 



Data 

set 


True 

class 


Predicted class 


1 


2 


3 


B-hd 


1 


62.6 


22.7 


14.8 


B-er 


1 


98.6 


1.2 


0.2 


D 


1 


71.1 


12.5 


16.5 


E 


1 


83.6 


1.7 


14.7 


F 


3 


7.9 


55.2 


36.8 


G 


3 


37.6 


26.9 


35.5 



Table E documents the classification rates for the training data, and Table 
El for the test data. In both cases the classifications have been made by taking 
the class which gives the largest posterior probability. A full set of results for 
our algorithm, including a treatment of uncertainty in position of the vehicle 
in the image, along with a comparison with other classification techninues|l4j. 
is given in The limited sets of results given here, show that we have been 
able to train the classifier well on the training data (greater than 98% classifier 
accuracy), and show the performance extending well to test data, when the 
vehicle and imaging conditions are similar. However the extension to classifying 
different vehicles to the same generic class, proves problematical, as illustrated 
by data sets F and G in particular. In part this is due to there sometimes being 
more similarity between two vehicles from different classes, than there is between 
two vehicles within the same class. The comparisons in 0 show the Bayesian 
mixture model technique to compare favourably with other classifiers, including 
a mixture model classifier based on the EM-algorithm. 

It should be noted, however, that classification rate does not give a true 
indication of the overall performance of a classifier |5|. Not least is the fact that 
it treats all misclassifications with equal weight. Where we have overlap between 
classes, we will never be able to obtain perfect classification, and a good classifier 
would indicate uncertainty between the classes for data in the overlapping region. 
Thus rather than a classification rate, an assessment of the accuracy of our 
estimates of the posterior probabilities is desirable. However, such an assessment 
is extremely difficult for real data. As well as class overlap there is also the issue 
of the possibly different misclassification costs involved p. 

5 Summary and Discussion 

In this paper we have developed a Bayesian mixture model approach to discrimi- 
nation. We have modelled the class-conditional densities as Gaussian mixtures, 
and conducted a Bayesian analysis under the assumption of constant model or- 
der of the mixtures. The use of the algorithm has been demonstrated on real 
datasets, consisting of ISAR images of mobile targets. On this data we have 
attempted the difficult task of classifying vehicles into generic classes, rather 
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than specific examples of a class. Future work will need to address issues such 
as whether this is actually feasible, the number of components per mixture (a 
model selection problem), alternative methods for assessing the performance, 
and the use of different component distributions in our mixture models, such as 
gamma distributions. However, this work has established that the technique is 
a viable and sound method for discrimination when test conditions are similar 
(but not necessarily identical) to the training conditions. 
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Abstract. When a Bayesian classifier is designed, a model for the class 
probability density functions (PDFs) has to be chosen. This choice is de- 
termined by a trade-off between robustness and low complexity — which 
is usually satished by simple parametric models, based on a restricted 
number of parameters — and the model’s ability to £t a large class of 
PDFs — which usually requires a high number of model parameters. 

In this paper, a model is introduced, where the class PDFs are appro- 
ximated as piecewise multi-linear functions (a generalisation of bilinear 
functions for an arbitrary dimensionality). This model is compared with 
classical parametric and non-parametric models, from a point of view of 
versatility, robustness and complexity. The results of classification and 
PDF estimation experiments are discussed. 



1 Introduction 

Bayesian pattern classification consists of selecting, for a given pattern uj, the 
class fit which maximises dt(ai) = P(w S fit)f{x\uj S fit), where x is the feature 
vector observed for the pattern. Beside the knowledge of the class prior proba- 
bilities P(o; S fit), the evaluation of the right hand side expression requires the 
class probability density functions (PDFs) f{x\oj S fit). These class probability 
densities are estimated from a training set. 

Examples of classical techniques are linear and quadratic discriminant analy- 
sis, which are based on modelling the functions f{x\uj G fit) by PDFs satisfying 
the normal distribution (e.g. f I I2i;-jj i . The parameters (the expectations of the 
features and their variances and covariances) are estimated from the training 
set data. The discriminant functions d* obtained this way (after simplification) 
are linear or quadratic as suggested. For a large class of distributions, the li- 
near and quadratic discriminant analysis techniques are quite robust to depar- 
tures from the normal distribution model. For some data distributions however, 
these techniques don’t yield satisfactory classifiers (especially, when dealing with 
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multi-modal distributions). To some extent, these problems can be dealt with 
by considering the estimated density as a mixture of two or more probability 
densities of a given model — like the approach of U, involving normal distri- 
butions. Independence from model constraints can be achieved by using non- 
parametric density estimation. One classical non-parametric approach is based 
on an approximation it of the probability densities using weighted sums of chosen 
orthonormal base functions (e.g. j2]): 



^t(x) = ^atj'ipj{x) , ( 1 ) 

where atj are a-priori unknown coefficients. This approach searches a solution 
for these coefficients, minimising the criterion: 

C= [ {i{x\uj G f2t) - it{x)f dx. (2) 

Jx£l 



It can be proven that the optimal coefficients may be estimated as: 




ujl G 



(3) 



where pt is the number of learning set patterns of class t. 

The advantage of the latter technique is its extreme simplicity. Practice shows 
however that the choice of the orthonormal base functions and the size of the 
expansion is not always an easy task. When m, the number of base functions in 
the expansion, is too small, a good representation of the real density function 
by the series expansion is not guaranteed; a high m-value requires however a 
large training set for a reliable estimation of the m coefficients; moreover, the 
complexity of the calculation of probability density values increases with m. In 
order to tackle the latter problem, we decided to consider an approach involving 
an extremely simple interpolation model: the approximated probability densities 
are modelled as piecewise multi-linear (PML) functions. We will show that this is 
equivalent to decomposing the PDFs with specific localised base functions. Other 
approaches have been reported involving orthogonal localised base functions like 
wavelets (e.g.: |51)- By choosing the PML approach we opted for a model yielding 
a relatively fast pattern classification, at the expense of a relatively slow classifier 
training phase, which is due to the fact that the chosen base functions are not 
orthogonal. 



2 Mathematical Background of the Proposed Method 

2.1 Problem Description 

The present paper proposes an alternative representation of approximated PDFs 
ft(ai), defined in a bounded domain I. This domain is divided into cells on the 
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basis of a multidimensional rectangular point lattice. The probability densities 
inside the cells are obtained by a multi-linear interpolation of probability density 
values at the lattice points (i.e. inside a cell and along any line segment, parallel 
to one of the main axes of the coordinate system, values are obtained by linear 
interpolation). The description of a function ft(x), is reformulated as a weighted 
sum of base functions, so that a reasoning can be applied, similar to the one 
used for the decomposition in orthonormal base functions. 

2.2 Definitions and Notations 

Before continuing the elaboration of the mathematics, we introduce here a few 
notations. 

— considering bounded values for the n components of a feature vector x, its 
components Xi (i G {1, . . . , n}) satisfy: Xi^^^ < Xi < Ximax! consequently, we 
define I as: 

n 

X ^ I — [^imin , ^imax] IR. , 

i=l 

— each point in the lattice is characterised by n indices (r^ G {0, . . . ,rrii}); 
to denote the indices of lattice point {xi^n,X 2 ,r 2 j ■ ■ • ,Xn,r„y, we will use a 
vector f = (ri, . . . , r„) as index; thus, a lattice point will be denoted by Xf 
from now on; the set of lattice point indices R is defined as: 

n 

rGi? = n{0>---.^*}CIN” ; 

i=l 

— the following subsets of lattice point indices are also defined: 

= {0, ir ; 

Rf = {f'\r, r' G i? A 3r" G : r' = f -|- r"} ; 

R,= U Rt \ 

R~ = {r\r G i? A Vi G {1, . . . , n} : r- rm} ; 



— the lattice determines (n — l)-dimensional hyper-planes in M", dividing I 
into cells (f G R ~ ) : 

Ip = {x\x G /,Vi G {1, . . . ,n} : x^^n < Xi < Xi^n+i} ; 

2.3 Description of the Model 

As mentioned, we introduce a piecewise multi-linear model for ft (a;): 
VfGi?”,VxG/^ : ff(x) = ^ 



(4) 
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where 

V*G{l,...,n} : g. = ~ (5) 

^2,ri + l ^i,ri 

and where S is Kronecker’s symbol. The coefficients ft^f' are the PDF values at 
the corner points f' G of cell . This is illustrated in Fig. Efor a dimensio- 
nality of n = 2. Note that inside cell the function is linear in any variable qi 





ft,(ri ,r2 + l) 


/£,(ri + l,r2 + l) 




1 (X1,X2)” 












ft,(ri,r2) 


h 

/t,(7-i,r'2 + l) 



Fig. 1. Piecewise bilinear probability density function model: inside cell 

the values of the approximated PDF are found by applying bilinear interpolation: 

b (3:1 , X2) ft,{ri,r2)(.^ gl)(f ^2) -f ,r2 + l) (1 gl) (gs) “t" /t, (ri +1 ,r2 ) (gl ) (f gs) 

-I- /t,(ri+i,r-2+i)(gi)(g2); gi and Q2 satisfy Q and take values, ranging from 0 to 1. 



for constant values of (Vj i), which means that inside a cell, the function 
implements a linear interpolation along the directions of the main coordinate 
axes. 

Consider now a special set of “unit” piecewise multi-linear functions Wf{x), 
defined on the basis of the same point lattice, and satisfying: 

n 

I I 5 

i=l 

meaning that the only point in the lattice, where the value of the function is 
non-zero (1, actually) is the point with index f. For these functions, (0 can be 
rewritten as: Vf" G R ~ , Vx G I^„ : 



Wf{x)= ^ <5fr' ||(1 - g») ‘ *gi 
?+ 2 = 1 



f'^Rl 



TT?T /I 

-Qi) ^ ^<k ^ e Rp' 



( 6 ) 



?+ 



0 r^R; 

Functions 'Jv(x) can be used as base functions for the decomposition of ft(x): 



f&R 
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This allows us to express the approximation criterion of (H as a function of 
unknown coefficients ft^f, by substituting ft(ai) by the right hand side of this 
equation. Equating (half) the derivatives of C, with respect to these coefficients, 
to zero, yields a system of equations: Vr S i? : 




f(a;|u; G f?*) - X! 

&R 



'l'f{x)dx = 



1 dC 

2 dft,f 



= 0 . 



Rearranging the terms yields: 'ir £ R : 

/t,f' / 'I/fi{x}'Ff{x)dx = I i(x\u) € f2t)'l^f{x)dx . 

icG/ J X^I 



( 7 ) 



If the functions were orthonormal, the integral in the left hand side of 
(0 would be zero for r' ^ r and 1 for r' = r, yielding a simple expression for 
estimating coefficient ft^r, like in 0- Fortunately the base functions Wf are only 
nonzero inside the cells immediately surrounding the lattice point ip. The 
only cells involved in the calculation of the integral in the left hand side of o, 
are cells I^„, for which both r and r' belong to the corner points; in other words: 
f G i?p,/ and f' G Rp„ . Note also that the right hand side of (j2J is an expression 
for the class conditional expectation E{l?v(i)|w G ilt}- This expression can be 
estimated from a training set as the mean value for the observations of class fit 
of function (like the result of 021)). On the basis of these considerations, an 
estimator for the unknown lattice point potentials ft^f> can be elaborated after 
substituting OSI) in 0 and using 



[ qfdq, = /"(!- qi)'^dqi = 1/3 , [ ft(l - qi)dq, = 1/6 . (8) 

Jo Jo Jo 



One obtains: \/r G R : 



E k.' E V(//;,) . 



r"GR- 
R + 



1\ /I 



2/ \3 



= - E ’ (9) 

uji e f2± 



where V(//l,) is the (hyper-) volume of cell obtained when transforming the 

variables Xi into qt, using © i.e.: 

n 

i^O 

Equations 0 constitute a set of ^R linear equations in the ^R unknown esti- 
mated lattice point potentials ft,r'- 



2.4 Generalisation to Unbounded Feature Spaces 

In order to apply the PML method to unbounded feature vectors y, their compo- 
nents are mapped to a^i-values, satisfying 0 < < 1. We decided to perform this 
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mapping by applying first an appropriate shifting and scaling of of the feature 
vector components yi — yielding y' — and applying the inverse logit function: 

logit"^ : M — s>]0, 1[ : logif^y' = ■ 

This function has the same form as the cumulative logistic distribution function|S]: 

F(y;y,cr) = , 

1 + e 

where y and a are resp. the expectation and the standard deviation of y. This 
means that, by putting (Vf G {1, . . . n}): 

..-1 f Tr{yz - 

Xi = logit ^ ^ j , (10) 

where fii and Ui are the expectation and the standard deviation of the feature 
yi, we achieve a maximal fit of the data to the logistic distribution and, in that 
sense, an optimal spreading of the values of a;i-data over the interval ]0, 1[. 

3 Experiments 

On the basis of the PML model, described in Sect. El we developed algorithms 
for the estimation of PDFs and for the design of Bayesian classifiers. These al- 
gorithms were implemented in the C programming language on a SUN SPARC- 
station running under the Solaris UNIX operating system. The algorithm, based 
on the set of linear equations, given by Q, and which calculates the lattice node 
PDF estimates ft,r, uses simple Gauss-Seidel iteration. The time needed in the 
following examples to converge to a solution is in the order of magnitude of 
seconds. 

3.1 Comparison with PDF Estimation on the Basis of Orthonormal 
Expansion 

In order to study the behaviour of the algorithms, from a point of view of esti- 
mation quality, we performed a few simple numerical simulation experiments, 
involving one single feature x. In all cases, we considered problems with the 
same number of unknown parameters: in the case of PML modelling, the do- 
main of X was split into 10 equal intervals (= “cells” in our terminology), in the 
case of orthonormal expansion we used a model based on Legendre polynomials 
with a maximal degree of 10. (Note that in both cases the model is characterised 
by 11 parameters: for the former one, 11 lattice point PDFs, for the latter one, 
the weight coefficients of eleven polynomials including the zero degree polyno- 
mial). In Fig. El we plotted the results of experiments with different kinds of 
probability densities (PD) which are zero outside I: a PD which is uniform for 
all X G I, a PD which is uniform in part of /, and zero elsewhere, (and therefore 
discontinuous in /), and a PD which satisfies a normal distribution, restricted 
to X G I. Details are described in the figure’s caption. 
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Fig. 2. Numerical univariate PDF estimation experiments; dotted lines show the exact 
distribution, solid lines, the estimation results: on the left side, the results based on a 
piecewise linear approximation, on the right side, the results based on PDF decompo- 
sition in Legendre polynomials; in the upper row, the sample size is 300, in the middle 
row and lower row, the sample size is 30000; the a of the “clipped” normal distribution 
used in the lower row is 0.1. 



3.2 Simulated Pattern Classification Experiment 

In order to verify the usefulness of the PML in circumstances where linear or 
quadratic discriminant analysis would fail, we performed a simulated two-class 
pattern classification experiment where the symmetry of the class PDFs of x 
causes Jii and /I 2 (the class expectations) and Si and S 2 (the class covariance 
matrices) to be identical. Figure El describes the used PDFs and the classification 
results in a experiment where the PML classifier involves 10^ cells. 

To compare the classification results with the ones obtained on the basis 
of orthonormal expansion, using 14th degree Legendre polynomials of the two 
variables, we designed a classifier, based on this model and on the same training 
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U) ^ lT?! 


OJ ^ lT?2 


oj assigned to f2i 


498 


10 


uj assigned to 


13 


479 



Fig. 3. Left: PDFs used in the simulated two-class pattern classification experiment: 
the (class independent) PDF of x is uniform inside the square; points inside the sectors, 
labelled 0 belong to class 17i, points inside the sectors, labelled 1 belong to class f? 2 - 
Right: result of classification applied to 1000 patterns of an independent test set; the 
model involves a lattice of 121 points; the size of the training set is 100000. 



set and applied it to the same test set; the PML function based model yields a 
classification efficiency of 97.7%, the orthogonal expansion model, 98.4%. 

3.3 Experiments on Real Measurement Data 

Finally, we performed a classification experiment, involving the Fisher iris data, 
used in the examples of stepwise discriminant analysis in [7j. These data are 
well suited to demonstrate the design and application of linear and quadratic 
discriminant functions. We applied both discriminant analysis and PML classi- 
fier design (10^ cells) to the two most discriminating features in the data. The 
training set contained 123 patterns. The classifiers were applied to 27 patterns of 
an independent test set; the classifier from quadratic discriminant analysis yields 

3 misclassifications; the PML classifier yields 4 misclassifications. Note that the 
design and application of the PML classifier, involves the mapping of the data 
given by m, using estimates for fj,i and at. 

4 Discussion and Conclusions 

In this section, we discuss the PML approach and compare it with quadratic 
discriminant analysis (QDA) and orthogonal expansion (OE) — i.e. expansion 
in orthonormal base functions. 

Versatility. The versatility of the PML model is mainly illustrated by the ex- 
ample of Sect. 1,3.21 In a situation where designing a classifier with QDA would 
make no sense at all, the PML model yields a high classification efficiency. 

Robustness. The examples of Sect. l,3. 1 1 show that for PDFs which in theory 
can perfectly be matched by both the PML model and the OE model, the per- 
formances in approximating the PDF are comparable (see: Fig. 0 upper row: 
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uniform distribution). However, in the presence of a discontinuity or when the 
approximated PDF has another kind of non-polynomial behaviour, the OE ap- 
proach may yield PDF estimates showing spurious local variations throughout 
the whole domain, while the effect of a model mismatch has only a local effect 
in the PML approach (see: Fig. 0, middle and lower row). The example of Sect. 
E3 illustrates that does not necessarily imply a better classification efficiency 
for PML classifiers, but even in that example, the efficiencies of both the OE 
approach and PML function approach are similar. The example of Sect. I.S..SI 
demonstrates that, compared to the classification efficiency of a classifier from 
QDA, the performances of the PML classifier, observed on an independent test 
set, are not much affected by the high number of parameters characterising this 
classifier (a model with 10 x 10 cells has 11 x 11 = 121 lattice points and an 
equal number of parameters ft,r)- 

Complexity, time consumption and memory requirements. The complexity of the 
algorithms in terms of elementary operations will be dependent on: the number 
of features n, the number of classes k, the size pt of the training set for each 
class and, specifically for the PML approach, the choice for each feature Xi 
of mi determined by the point lattice. If we limit ourselves to the operations 
which will be crucial for the speed of the algorithms, it is clear that, during 
the training phase, the PML approach is by far the most time consuming one: 
the calculation of the unknown lattice point potentials in o, for one class, is 
an iterative process, where at each step the number of multiplications is in the 
order of magnitude of In both QDA and OE, the training time is 

mainly determined by pt {t G {1, ... ,k). The time spent by the PML method 
to calculate the right hand side of 0 is also considerable but is still small in 
comparison to the time consumed by the abovementioned iterative process even 
for moderate n and relatively low m^-values. We however did no effort until now, 
to optimise this stage of the training process. 

The time consumption during the application of the classifier to a pattern, 
is much more favourable: — see: m and 6 — the number of multiplications 
is in the order of magnitude of 3 x 2” (which includes the calculation of the 
basis function values). For QDA, the number of multiplications is in the order 
of magnitude of n x (n -|- 1), which is relatively small for high n- values. When 
one compares however the PML approach with OE, one should be aware that 
the time consumption is independent of the total number of base functions in 
the PML approach (since only one cell is involved), while in OE — see: (0 — 
all base functions have to be evaluated and all values have to be weighted with 
the corresponding coefficients atj. When a regular lattice is used, the time to 
identify the involved cell and to calculate the gi-variables can be neglected. 

The memory required for the storage of the model parameters is obviously 
the lowest is the case of QDA, since the major part is taken by the storage of 
the matrices i.e. storage for kn^ floating point numbers. For comparable 

performances, the number of parameters in the OE model and the PML model 
are roughly equal. Like time consumption, the memory requirements are an 
exponential function of the number of features: in the case of a PML classifier 
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model involving W^rrii cells for each class, a storage of + 1) floating 

point numbers is required. During the training of the PML model parameters 
this must be doubled for the storage of the right hand side of (0. 

Conclusions. This paper introduces the theory and methods to estimate pro- 
bability density functions on the basis of a piecewise multi-linear model. An 
approach has been proposed to generalise the application of the methodology to 
unbounded feature spaces. It has been demonstrated that the model, introduced 
in the present paper, is able to represent a large class of probability distributions 
in a satisfactory, robust way. It has also been shown that the efflciency of the 
derived classifiers is often better than — and at least equivalent to — the effi- 
ciency of classifiers obtained using quadratic discriminant analysis or orthogonal 
expansion. An attractive particularity of our approach is that almost all com- 
putation overhead is shifted to the training procedure (which only needs to be 
applied once for a given classifier). The classifier is fast in comparison to other 
classifiers of similar complexity. 
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Abstract. A genetic algorithm is employed in order to select the ap- 
propriate number of components for mixture model classifiers. In this 
classifier, each class-conditional probability density function can be ap- 
proximated well using the mixture model of Gaussian distributions. The- 
refore, the classification performance of this classifier depends on the 
number of components by nature. In this method, the appropriate num- 
ber of components is selected on the basis of class separability, while a 
conventional method is based on likelihood. The combination of mixture 
models is evaluated by a classification oriented MDL (minimum descrip- 
tion length) criterion, and its optimization is carried out using a genetic 
algorithm. The effectiveness of this method is shown through the expe- 
rimental results on some artihcial and real datasets. 

Keywords: mixture model classifier, class-conditional probability den- 
sity function, class separability, minimum description length criterion, 
genetic algorithm 



1 Introduction 

In pattern recognition, we often apply the mixture model of Gaussian distribu- 
tions to the approximation of a class-conditional probability density function. 
The mixture model is more expressive than simple distributions, so the Bayesian 
classifier based on the mixture models has higher classification performance than 
simple classifiers. In addition, this semiparametric classifier can absorb the sta- 
tistical fluctuation of training samples, unlike nonparametric classifiers. 

In the mixture model classifiers, the number of components directly affects 
the classification performance by nature. Nevertheless, in many applications, the 
number of components seems to be selected without careful consideration. One 
practical and reasonable selection method is based on MDL (minimum descrip- 
tion length) criterion [p. In the method, the appropriate number of components 
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is selected for each class on the basis of the likelihood for the training samples, 
penalizing too large number of components in order to avoid overfitting. 

However, in pattern recognition, our main concern is classification. Such a 
class-independent selection may not be appropriate, especially in the case of 
the small number of training samples. For this problem, we have proposed a 
class-combinatorial selection method |3. In the method, the combination of mix- 
ture models is evaluated on the basis of their class separability for the training 
samples, penalizing too complex classifier in order to avoid overlearning. 

In our previous studies 031 , the classification oriented criterion seemed to 
worked well. The proposed method marked the best recognition rate in some ex- 
periments. However, because of the low ability of the employed greedy algorithm, 
we could not show the relation between the optimal number of components in 
the criterion and the resultant recognition rate. 

Therefore, in this study, we employ a genetic algorithm^ in order to search 
the quasi-optimal solution. Experimental comparison of these three methods, 
the likelihood method, the class separability method with a greedy algorithm 
and the class separability method with a genetic algorithm, is carried out. 

2 Mixture Model Classifiers 

The mixture model classifier is one practical and effective implementation of 
Bayesian classifier. In this classifier, each class-conditional probability density 
function is approximated by the mixture of some component distributions. For 
this purpose, Gaussian distributions are generally used. Each probability density 
function is described as follows: 

K 

p{x) = ^Ck^{mk,Uk){x), 

k^l 

where a: is a feature vector, K is the number of components, N(mfc, T'fc)(-) is a 
Gaussian distribution with a mean vector and a covariance matrix Sk, and 
Ck is the weight of the kth component Cfc = 1). 

In this study, we estimate the mixture models using an iterative proce- 
dure called EM algorithmic. It enables local maximization of the following log- 
likelihood indirectly: 

N N ( K 'j 

L = ^lnp(a;d = I '^Ck'N{mk, Sk){x^) \ , (1) 

i=l i=l l/c=l J 

where N is the number of training samples 

EM algorithm requires both the number of components and the initial com- 
ponents. For the initial components, we use fuzzy c-means methodjC with the 
fuzziness factor 1.6. Thus, once the number of components is given, we can obtain 
a mixture model by EM algorithm through the fuzzy clustering. 

We estimate a mixture model for each class and combine them into a classifier 
through Bayesian theory. Therefore, in order to make best use of the mixture 
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model classifier, we have to select the appropriate number of components for 
each class in advance. 

3 Selection of the Appropriate Number of Components 

In this study, the appropriate number of components is selected basically by 
repeating the estimation and evaluation of a mixture model, varying the number 
of components. The evaluation is made by MDL criterion Q. There are two 
methods according to the calculation of the criterion. 

3.1 Selection Method Based on Likelihood 

In this method, we select the appropriate number of components by minimizing 
the following conventional formula: 



where L is the log-likelihood of Ep.lQJ, and m = {K — 1) + K {D + D{D -|- 1) /2} 
is the number of parameters for specifying one mixture model. Here, D is the 
dimensionality of the feature space. 

In Eq.(|2l, the first term is low when the mixture model fits well to the training 
samples, and the second term is low when the mixture model is simple, i.e., the 
number of components is small. In another words, this method maximizes the 
likelihood, penalizing too large number of components. 

We can obtain the mixture model that minimizes Eq. simply by varying 
the number of components within a certain range. Note that this method is 
carried out for each class individually. 

3.2 Selection Method Based on Class Separability 

In order to take the class separability into account, we use another calculation of 
MDL criterion proposed by Kudo et al.\^. In this method, both the classification 
ability for the training samples S and the complexity of the classifier c are 
evaluated in bit length. We select the classifier which minimizes the following 
formula: 



where L{S\c) is the description length for encoding the labels of the training 
samples among S misclassified by c, and L{c) is the description length of c itself. 

The detail calculation of the above two terms for the mixture model classifiers 
is as follows: 



MDLlh 




(2) 



MDLcs = L(5|c) + L(c), 



(3) 



L(S\c) = iV-H(lVf , . . . , N^\N~) + i(Af - 1) log^ W 



+NH{N-,N+\N) + hog^ N, 
L{c) = aKD^ log 2 N, 
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where M is the number of classes, N~^ and N are the number of samples classi- 
fied correctly and incorrectly, respectively, and H(7Vi, . . . , Nm\N) is the entropy 
calculated by — ^ Here, K is the total number of components for 

all classes, and a and (3 are constants for suppressing the superabundant effect 
of the second term. We experimentally determined a to be 0.235 and j3 to be 

0.876 for the mixture model classifiers. See the referenceQ for the detail. 

In Eq.(P|), the first term is low when the classifier can classify the training 
samples well, and the second term is low when the classifier is simple. In another 
words, this method maximizes the class separability, penalizing too complex 
classifier. 

In this method, the search of the classifier which minimizes Eq. m is a class- 
combinatorial problem. We implemented the following two different algorithms 
for the mixture model classifiers. 

— Greedy Algorithm 

Start with one component for each class, and increase the number of compo- 
nents for a certain class so as to decrease Eq.®. (proposed in the reference j2j) 

— Genetic Algorithm ^ 

Let K 1 K 2 ■ ■ ■ Km be “chromosome”, where “gene” Ki is the number of com- 
ponents for the ith class. Let the set of C chromosomes be “population.” 
Follow these steps. 

1. Initialize the population. 

Each chromosome is initialized with random numbers. 

2. Select the better chromosomes. 

a) For each chromosome, calculate MDLcs according to Eq.()3I). 

b) For each chromosome, let (MDL-worst — MDLcs) be the “fitness” /. 
Here, MDLworst is the worst MDLcs among the past W generation. 

c) Select C chromosomes as a new population, according to the pro- 
portions of the fitness, i.e., the jth chromosome is selected with the 
probability of fj/'Yhkfk- Note that the better chromosomes may be 
selected two or more times. 

3. Generate the crossovers of the chromosomes. 

a) For each chromosome, select a pair for the chromosome randomly, 
and generate uniform crossovers between them with the probability 
of Pc- This means the corresponding genes between the two chromo- 
somes are exchanged randomly. By this process, 2C chromosomes 
are generated. 

b) Select C chromosomes randomly from the 2C chromosomes. 

4. Mutate the chromosomes. 

For each chromosome and each gene in the chromosome, modify the gene 

randomly with the probability of Pm- 

5. Repeat G times from step 2 to step 4. 
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In this study, we used 50 chromosomes for C, 5 generations for W, 60% for 
Pc, 1/M for Pm and 100 generations for G. In addition, we employed “elitist 
strategy,” in which the best chromosome is necessarily saved to the next 
generation. This technique warrants MDLcs to be nonincrease. We adopt 
the final elite as the quasi-optimal solution of this problem. 

4 Experiments 

4.1 Artificial Datasets 

First, we carried out experiments using two artificial datasets called “Cross” and 
“Ring.” Each dataset is a 2-class problem in a 2-dimensional feature space. In 
Cross dataset, one class forms a cross and the other class comprises four squares 
surrounding the cross. These two classes are perfectly separable. In Ring dataset, 
one class forms the uniform ring with the inner radius of 2.147 and the outer 
radius of 3.030, and the other class is the Gaussian distribution with the mean 
vector at the origin and the unit covariance matrix. The radii were adjusted so 
as to the Bayesian recognition rate be 95.5%. The example results obtained by 
the class separability method with the genetic algorithm are shown in Fig. ^ In 
these examples, 100 training samples were used per class. 





Feature 1 

(b) Ring dataset (Ki=1, K2=5) 



Fig. 1. Example results for (a) Cross dataset and (b) Ring dataset. The small symbols, 
the ellipses and the solid lines show the training samples, the component distributions 
and the classification boundaries, respectively. 



For these two different types of datasets, we compared the three selection 
methods, using 20, 30, 40, 50, 100, 200 and 500 training samples per class. For 
each number of training samples, we carried out ten trials with different random 
seeds and let the averages be the results. The experimental results are shown in 
Fig. 121 and Fig. 0 

Almost the same tendency was obtained in both datasets. In the case of the 
small number of training samples, the recognition rates of CS-Greedy and CS- 
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(a) Recognition Rate (b) MDLcs 
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(c) Likelihood 



(d) Number of Components 



Fig. 2. Experimental result for Cross dataset. CS-Genetic, CS-Greedy and LH corre- 
spond to the class separability method with the genetic algorithm, the class separability 
method with the greedy algorithm and the likelihood method, respectively. The like- 
lihood in (c) shows the total sum of the log-likelihood for all classes, and the number 
of components in (d) shows the total sum of the number of components for all classes. 



Genetic are higher than that of LH. However, in the case of the large number of 
training samples, LH is better than CS-Greedy. CS-Genetic almost agreed with 
LH. As for the minimization of MBL^s, CS-Genetic marked lower MBL^s than 
CS-Greedy. This shows that CS-Genetic worked well. 

Note that a high recognition rate was not always obtained from high like- 
lihood models, especially in the case of the small number of training samples. 
However, in the case of the large number of training samples, high likelihood 
models marked a better recognition rate than the other models. This confirms 
LH can estimate the underlying distributions precisely in the case of the large 
number of training samples. 

As for the number of components, LH selected many components, especially 
in the case of the small number of training samples. On the other hand, the 
both CS methods selected only a few components in that case. As the number of 
training samples increases, the number of components of LH converged within a 
certain range, that of CS-Greedy did not converge, and that of CS-Genetic con- 
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(c) Likelihood (d) Number of Components 



Fig. 3. Experimental result for Ring dataset. CS-Genetic, CS-Greedy and LH corre- 
spond to the class separability method with the genetic algorithm, the class separability 
method with the greedy algorithm and the likelihood method, respectively. The like- 
lihood in (c) shows the total sum of the log-likelihood for all classes, and the number 
of components in (d) shows the total sum of the number of components for all classes. 



verged. From the consistency property of MDL criterion^, the selected model 
have to be converged according to the increase of the number of training samples. 
The results also shows the CS-Genetic worked well from this viewpoint. 



4.2 Real Datasets 

We also compared the three methods using two real datasets called “Ship” and 
“Wine.” Ship aims to distinguish 8 types of military ships with 11 features. See 
the reference|B| for the detail. Wine is taken from the UCI machine learning 
database jO]. Three wines are recognized by 13 chemical features. The total num- 
bers of samples are 2545 and 178, respectively, and we estimated the recognition 
rates by 2-fold and 10-fold cross validations, respectively. Bayesian quadratic 
classifier and /c-nearest neighbor method (fc = 5) were also tested. The experi- 
mental results are shown in Table n and Table 0 In addition, the transitions of 
MDLcs and the recognition rate are shown in Fig. 0 and Fig. 0 
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Table 1. Experimental results for Ship dataset. Here, R, # and L* are the recognition 
rate, the total sum of the number of components for all classes and the total sum of 
log-likelihood for all classes, respectively. 



Classifier 


R[%] 


# 


L* 


MDLcs 


Mixture(CS-Genetic) 


94.32 


17.5 


8771.52 


476.02 


Mixture(CS-Greedy) 


94.32 


17.5 


8771.52 


476.02 


Mixture(LH) 


94.07 


34.0 


14794.61 


— 


Quadratic 


93.15 


— 


— 


— 


fcNN(fc = 5) 


86.71 


— 


— 


— 



Table 2. Experimental results for Wine dataset. Here, R, # and L* are the recognition 
rate, the total sum number of components for all classes and the total sum of log- 
likelihood for all classes, respectively. 



Classifier 


R[%] 


# 


L* 


MDLcs 


Mixture(CS-Genetic) 


97.87 


3.0 


-1646.71 


60.01 


Mixture(CS-Greedy) 


97.87 


3.0 


-1646.71 


60.01 


Mixture(LH) 


30.11 


23.2 


4321.24 


— 


Quadratic 


97.87 


— 


— 


— 


fcNN(fc = 5) 


95.77 


— 


— 


— 




(a) MDLcs 




(b) Recognition Rate 



Fig. 4. Transition of MDLcs and the recognition rate for Ship dataset (the first trial 
in the 2-fold cross-validation). 



In these datasets, we could not obtained a significant difference between CS- 
Genetic and CS-Greedy. However, the both GS methods marked higher recogni- 
tion rates than LH. As for the number of components, LH selected the larger 
number of components than the others. As a result, the likelihood of LH was also 
higher. From the result for the artificial datasets, this means the number of the 
training samples for these real datasets might be relatively small to their own 
difficulties. In addition, the recognition rates of both GS methods were higher 
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Generation 
(a) MDLcs 




(b) Recognition Rate 



Fig. 5. Transition of MDLcs and the recognition rate for Ship dataset (the second trial 
in the 2-fold cross-validation). 



than those of the two conventional classifiers. This shows the superiority of the 
mixture model classifiers with the appropriate number of components. 

From Fig. ETa) and Fig. 0(a) , we can confirm that CS-Genetic worked well for 
minimizing MDLcs. However, in Fig. ETb) and Fig. 0(b), the peaks of the reco- 
gnition rate curves for the test samples were appeared in the earlier generations. 
If the true model, i.e., the optimal mixture model classifier, is more complex 
than the selected model, this result agree with the consistency property of MDL 
criterion. 



5 Conclusion 

The genetic algorithm was employed in order to select the appropriate number 
of components for the mixture model classifiers. In this study, the appropriate 
number of components was selected on the basis of class separability, while the 
conventional method was based on likelihood. The combination of mixture mo- 
dels was evaluated by the classification oriented MDL criterion, and its opti- 
mization was carried out using the genetic algorithm. The effectiveness of this 
method was shown through the experimental results on some artificial and real 
datasets. 

We have to carry out more experiments on the other datasets, in order to 
justify the classification oriented MDL criterion. Artificial datasets with varying 
the dimensionality of the feature space will be utilized for this purpose. The 
effect of the parameters in the genetic algorithm also should be investigated. 
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Abstract. The mixel is a heterogeneous pixel that contains multiple 
constituents within a single pixel, and the statistical properties of a po- 
pulation of mixels can be characterized by the mixel distribution. Prac- 
tically this model has a drawback that it cannot be represented in clo- 
sed form, and prohibitive numerical computation is required for mixture 
density estimation problem. Our discovery however shows that the “mo- 
ments” of the mixel distribution can be derived in closed form, and this 
solution brings about significant reduction of computation cost for mix- 
ture density estimation after slightly modifying a typical algorithm. We 
then show the experimental result on satellite imagery, and find out that 
the modified algorithm runs more than 20 times faster than our previous 
method, but suffers little deterioration in classification performance. 



1 Introduction 

The mixel, or the mixed pixel, is the “amalgam” of multiple constituents con- 
tained within a single pixel. Because of the finite resolution of sensors we use 
to observe the real world, a heterogeneous region, as well as a homogeneous re- 
gion, is scanned as a single pixel. Hence we should regard the digital imagery 
as the spatially quantized (sampled) representation of the real world, and take 
for granted the presence of mixels. When the resolution of sensors is relatively 
coarse to the scale of objects in the real world, for example in remote sensing or 
medical images, the presence of mixels is usually inevitable. 

To characterize the statistical properties of the mixels based on the proba- 
bility theory, we can think of the probability distribution function (PDF) of 
mixels, or the mixel distribution. It is basically a phantom distribution not cor- 
responding to any objects in the real world, because the mixture of multiple 
constituents takes place through the process of observation. The author insists 
that this implicit distribution should be taken into account explicitly for the 
statistical analysis of images. Our findings on the closed form moments of the 
standard mixel distribution (SMD) then opens the way to the efficient compu- 
tation of mixture density estimation including the SMD. 

The organization of the paper is as follows. Firstly, Sect. |2I introduces related 
works on mixels and also states several definitions. Secondly, Sect. 0 describes 
main results of the paper, namely the closed form moments of the SMD. Then 
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Sect. 0 explains the mixture density estimation modified for the efficient com- 
putation of the SMD. Next Sect. 0 illustrates experimental results on satellite 
imagery, and finally Sect. 0 concludes the paper. 



2 Mixel Models 



Related Works. Past literature on mixels mainly focused on the estimation 
of area proportions for each mixel from observed pixel values. Methods can be 
grouped into several approaches; for instance, image geometry-based methods 
m], probability model-based methods [33, fuzzy model-based methods aa, li- 
near projection-based methods 0, and non-linear regression-based methods |2|. 
The approach used in this paper shares some motivation with the probability 
model-based methods in that area proportions are estimated from parametric 
PDF models of constituent classes. However, the most important distinction 
of our method from those methods lies in the introduction of the mixel distri- 
bution. This model has not been paid much attention, probably because this 
distribution is hidden, and affects only when the percentage of mixels is relati- 
vely high. However, the mixel distribution shows interesting properties different 
from conventional PDF models 1 1 )j . and hence requires special treatment for 
its efficient computation. Next we summarize definitions required afterwards. 

Definition 1 (Linear Model of the Mixel). A “K -class mixel” is a pixel that 
eontains K number of eonstituent classes. Let the area proportion of eonstituent 
class Ci be ai G (0, 1). From the definition of the area proportion, the sum of area 
proportions for all constituent classes amounts to unity; namely ~ 1- 

Moreover let the radiation from class Ci be a D-dimensional vector Xi. Then we 
hypothesize that the observed radiance of the mixel x is represented by the linear 
combination of the radiance from the constituents. Neglecting the common noise 
term for simplifying the argument, we have 



Definition 2 (Definition of the Mixel Distribution). From m, it is clear 
that the mixel distribution consists of two types of PDF models; namely a random 
vector Xi drawn from the PDF of constituent class Ci represented by p(xi\Ci,'tpi) 
with parameters ifi € T, and another random vector a = (oi, . . . ,aK-i) € A 
drawn from the PDF of area proportions represented by f{a;(p), where A = 



K 




( 1 ) 



space of area proportions, and 

qk = Then the “D-band K -class mixel distribution” can be derived 

from K number of class-conditional D-dimensional multivariate PDF and one 
{K — 1)- dimensional multivariate PDF of area proportions J^. 
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Definition 3 (Definition of the SMD). The term “standard mixel distri- 
bution” (SMD) is eoined for representing a special mixel distribution derived 
only from normal distributions (the PDF of constituent class) and the Beta dis- 
tribution (the PDF of area proportions). Formally, the PDF of class Ci is a 
multivariate normal distribution: 



p{x\Ci;ifi) = N{pii,Ei) 



1 



exp 








( 2 ) 



where x is a D-dimensional column vector, fii is the D -dimensional mean 
vector. Si is the D-by-D covariance matrix, (x — is the transpose of (x — 
pLi), Si~^ is the inverse of Si, and \Si\ is the determinant of Si. On the other 
hand, the PDF of area proportions is the {K — 1)- dimensional Beta distribution: 



f{a;(f>) = - 



r ((()i + (/>2 H h (( k ) 






K 

n< 

Z=1 



r{T) 



K 






0i-i 



( 3 ) 



where <f> is a D-dimensional parameter vector with i-th component (ft > 0 for all 



i, ip = ^^6 Gamma function. 



3 Central Moments of the SMD 

Derivation of the SMD. First we choose a certain fixed vector for area pro- 
portions. Then m represents the linear combination of random vectors drawn 
from normal distributions. Because the normal distribution is a stable distri- 
bution, the mixel distribution also becomes the normal distribution with the 
following parameters: 

K K 

P>a = ^ Sa = ^ ^ Si , (4) 

i=l i=l 

If we further assume mutual independence between random vectors Xi and the 
random vector a, the randomization of N(pia, So) with the a priori distribution 
of area proportions yields the SMD M{x) as follows |H|: 

M{x) = JJ f {a-, 4>)N{fXa.,Sa)da . (5) 

A 

Central Moments of the SMD. Now we start to derive the moments of 
the SMD (0. Unfortunately the SMD itself cannot be represented in closed 
form because the integral involved in 0 cannot be solved. However, the central 
moments, namely the mean vector and the covariance matrix can be calculated in 
closed form, which is the main results of this paper. For the concise presentation 
of the paper, we will describe the detailed proof in the Appendix. 
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Proposition 1. The mean vector fiM of the SMD is represented by: 



fJ-M = 






( 6 ) 



Proposition 2. The covariance matrix Sm of the SMD is represented by: 



Sm = 



EZ, + i)i^i + |e.=i - E.=1 E|i 



<p(<p + i) 



(7) 



In practice, above equations are applied in simpler forms with smaller number 
of dimensions and constituent classes. Hence we shall investigate several repre- 
sentative cases for the one-band two-class SMD. In this case, the area proportion 
distribution is equivalent to one-dimensional Beta distribution: 



f(a; 4>) 



T{(j)i + (j)2) d>i 

n^nh) 






( 8 ) 



where Ui = a and 02 = 1 — a. Then the moments of the mixel distribution can 
be derived from (0 and o with substituting pj — >■ and Si ^ af. 



Tm= , , , (0lPl +</>2P2) , 

01 + 02 



^2 _ 
— 



0l(01 -f l)a\ + 02(02 + l)cr| -f (M2 - Pi) 

(01 + 02)(01 +02 + 1 ) 



(9) 

( 10 ) 



Example 1. The simplest SMD corresponds to the case when the PDF of con- 
stituent class Ci is the delta function p{x\Ci) = 6{x — pi), and the PDF of area 
proportions is the uniform distribution 0i = 02 = 1. This is a special case since 
we can analytically solve (0 to obtain the closed form mixel distribution: 

M{x) = , (11) 

P2 - Pi 

where p 2 > pi is assumed without losing generality. The mean and the variance 
of this uniform distribution can be trivially calculated: 

Pi + P2 2 1 / \2 n 

PM = ^ , ctm = ^(P2-Pi) • (12) 



These results can also be obtained by substituting crj = cr| = 0 and 0i = 02 = 1 
in (|0I) and (II 1)11 . which shows this is a special form of the SMD. 
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No. 


{4>1, 02) 






(1) 

(2) 

(3) 

(4) 

(5) 


(0.5, 0.5) 
(1.0, 1.0) 
(1.5, 1.5) 
(1.5, 0.5) 
(0.5, 1.5) 


(250.0, 107.8) 
(250.0, 88.5) 
(250.0, 77.1) 
(175.0, 78.7) 
(325.0, 76.2) 


(250.0, 107.5) 
(250.0, 88.5) 
(250.0, 77.3) 
(175.2, 78.5) 
(324.8, 76.0) 



Fig. 1. The standard mixel distributions for five cases. The left figure illustrates 
the shape of the SMD generated from same normal distributions ^"(100, 900) and 
A^(400, 100), and five different set of Beta distribution parameters shown in the right. 
Compare the moments obtained theoretically and empirically (/tmjCTm). 



Example 2. Let us consider more “practical” case when the PDF of constitu- 
ent class Ci is a univariate normal distribution and the PDF of area 

proportions is the uniform distribution. Substituting (E) and (unj with corre- 
sponding parameters in this case, we obtain the moments as follows: 

Ml + 2 1 2 I 21 I 1 / ^2 /I o\ 

Mm— ^ , O'M — g (o"i + '^ 2 ) + y^(M 2 — Ml) ■ (13) 

Here the variance consists of two terms; the first term represents the effect 

of variance from constituent classes, while the second term is affected by the 
distance between means. The latter is in particular unique in the mixel distribu- 
tion, because it is not caused by the random noise, but caused by the mixture 
of different constituents. This term also corresponds to (O- 

Fig. m illustrates the shape of one-band two-class SMD for representative 
five cases. In spite of different skewness and kurtosis of the SMD, all the SMDs 
similarly extend between peaks of the constituent classes’ PDFs. On the other 
hand, the right table summarizes the comparison between theoretically derived 
moments ®, m and empirical moments calculated from the simulated data. 
Consistency in this table intuitively verifies theoretical moments derived above. 

4 Mixture Density Estimation by EM Algorithm 

Finite Mixture Density and EM algorithm. Image classification method 
used in this paper is based on a typical algorithm; that is, 1) Estimate finite 
mixture density from the statistics of the observed image, 2) Classify each pixel 
based on Bayes decision rule. Here the mixture density model is as follows: 

M 

p{x\ip) = '^p{x\C^,ipi)P{Ci) , 

2 = 1 



(14) 
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where M is the number of distributions, and P(Ci) is the mixing parameter. 
It should now be emphasized that in our model M includes both the class- 
conditional PDF (normal distributions) and the SMD (not normal distributions). 
Then the problem is to search for parameters in d that best describes the 
statistics of the observed image. 

One of the standard iterative algorithm for solving this problem is the EM 
algorithm m- Conceptually this algorithm consists of two steps, namely E-step 
and M-step. The update rules of parameters can be written as follows: 



pt+ur 1 - 1 ^ 






N 

= a,rgma,xy^ log p{xk\Ci,ipi) 



p\a)p{xk\a,i’i*) 

p{xk\ip*) 



(15) 

(16) 



where the superscript t denotes t-th iteration. In addition, more efficient update 
rules are available for parameters of the normal distribution; 



t+l _ \^k=l^k / 



Mi — 



fy-W pt(Ci)p(3;fc|Ci.V.p) l 
\^k=l p(a;fc|V>‘) J 



= 



(eL.p* - f. •«)(*. - 



N P*(Ci)p(xk\Ci,il,p) 



i 2^k=i 






‘) J 



(17) 

(18) 



Approximation of the SMD. However, efficient update rules jni), d or 
similar rules cannot be applied to the SMD because it is not a normal distribu- 
tion nor cannot be represented in closed form. Numerical optimization d and 
integral © requires heavy computation, so the incorporation of the SMD into 
mixture density model d has not been a practical choice. 

However, in this paper, the results on the moments of the SMD play an 
important role in realizing efficient computation of mixture density estimation. 
The idea we try is very simple. That is, to exploit efficient update rules d and 
m, we replace the SMD with the normal distribution in other 

words, we approximate the SMD with the normal distribution having identical 
mean and covariance. 

A modified algorithm can be described as follows. 1) All mixing parameters 
are updated by d- 2) Parameters of class-conditional distributions are up- 
dated by (ED and ED- 3) Parameters of the mixel distributions are updated 
according to the theoretical moments (0 and dzj with updated parameters ob- 
tained above. Note that this algorithm is not just for increasing the number of 
normal distributions. Parameters MAf, are not freely chosen but calculated 
so that it best describes the corresponding mixel distribution. 

Fig. ^compares the original and the approximated version of the SMD. This 
approximation is reasonable because it is theoretically guaranteed that the ap- 
proximated version has the identical mean and covariance to the original version. 



The Mixel Distribution and Its Application 



527 





Fig. 2. Approximation of the SMD. Compare the original SMD (bold dotted line) to 
the approximated normal distribution (bold solid line). Beta distribution parameters 
for each case are identical to those used in Fig. 0 



Although some of the information is lost from the original SMD, especially aro- 
und regions of skewed peaks, it is shown from experiments (which are not shown 
here) that unless Beta distribution parameters are small, the deviation does not 
affect fatally in terms of classification performance. 

5 Experimental Results 

We demonstrate the result of image classification on satellite images. The original 
image Fig. He) contains sea and cloud regions, whose meteorological condition 
strongly suggests that many mixels should be present on the image. Fig. 0 (a) 
and (b) shows the result of mixture density estimation using the EM algorithm. 




Temperature (Absolute) 



(a) Mixture density 
estimation with 
the approximated SMD. 
(Time = 0.1 seconds) 




Fig. 3. The result of image classification of satellite imagery, (a) and (b) shows the 
results of mixture density estimation, (c) is the original image, and (d) is the result of 
classification using (a), where white regions are cloud, black regions are sea and gray 
regions are mixels of both constituents. In (a), (pi = (pi = 0.5 is fixed. 
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In terms of goodness-of-fit to the image histogram, (b) is superior to (a). Howe- 
ver please note that this better result is obtained from 20 to 100 times longer 
computation time than (a) using the approximated SMD. Here “20 times faster” 
may be a little exaggerated comparison because the previous method is not effi- 
cient in terms of speed, but the proposed algorithm is in fact sufficiently fast for 
practical use like this problem. More importantly, Bayesian borders are almost 
same in both cases, which suggests that classification performance is nearly equal 
in both cases. 

6 Conclusion 

Hence the proposed algorithm is much faster and almost equally effective com- 
pared to our previous method. Thus we conclude that the approximation of the 
SMD using the moments is a promising way to develop our classification algo- 
rithm to a more sophisticated algorithm that can be applied to a wide range 
of applications. With the proposed classification method, the statistics of the 
image can be analyzed more appropriately and accurately under the presence of 
mixels. Moreover, the effect of speed-up we showed in this paper becomes more 
important when we increase the dimension D to apply our methods to multi- 
spectral images. The method itself can be easily extended to multi-spectral case, 
so the extension in this direction is practically the most important future work. 



References 

[1] Tabatabai, A.J. and Mitchell, R. Edge Location to Subpixel Valnes in Digital 
Imagery. IEEE Trans. Patt. Anal. Mach. IntelL, Vol. 6, No. 2, pp. 188-201, 1984. 

[2] Santago, P. and Gage, H.D. Quantification of MR Brain Images by Mixtnre 
Density and Partial Volnme Modeling. IEEE Trans. Med. Imq. Vol. 12, No. 3, 
pp. 566-574, 1993. 

[3] Choi, H.S., Haynor, D.R., and Kim, Y. Partial Volnme Tissue Classification of 
Multichannel Magnetic Resonance Images — A Mixel Model. IEEE Trans. Med. 
Img., Vol. 10, No. 3, pp. 395-408, 1991. 

[4] Wang, F. Fnzzy Supervised Classification of Remote Sensing Images. IEEE Trans. 
Geo. Remote Sens., Vol. 28, No. 2, pp. 194-201, 1990. 

[5] Kent, J.T. and Mardia, K.V. Spatial Classification Using Fnzzy Membership 
Models. IEEE Trans. Patt. Anal. Mach. Intel!, Vol. 10, No. 5, pp. 659-671, 1988. 

[6] Settle, J.J. and Drake, N.A. Linear Mixing and the Estimation of Ground Cover 
Proportions. Int. J. Remote Sensing, Vol. 14, No. 6, pp. 1159-1177, 1993. 

[7] Foody, G.M. Relating the Land-Cover Composition of Mixed Pixels to Artificial 
Nenral Network Classification Output. Photo. Eng. Rem. Sens., Vol. 62, No. 5, 
pp. 491-499, 1996. 

[8] Kitamoto, A. and Takagi, M. Image Classification Using Probabilistic Models 
that Reflect the Internal Structnre of Mixels. Patt. Anal. AppL, Vol. 2, No. 2, 
pp. 31-43, 1999. 

[9] Kitamoto, A. and Takagi, M. Image Classification Using a Stochastic Model that 
Reflects the Internal Strncture of Mixels. In Amin, A., Dori, D., Pudil, P., and 
Freeman, H., editors. Advances in Pattern Recognition, Vol. 1451 of Lecture Notes 
in Computer Science, pp. 630-639. Springer, 1998. 




The Mixel Distribution and Its Application 



529 



[10] Kitamoto, A. and Takagi, M. Area Proportion Distribution — Relationship with 
the Internal Structure of Mixels and its Application to Image Classification. Syst. 
Comp. Japan, Vol. 31, No. 5, pp. 57-76, 2000. 

[11] Redner, R.A. and Walker, H.F. Mixture Densities, Maximum Likelihood and the 
EM algorithm. SIAM Review, Vol. 26, No. 2, pp. 195-239, 1984. 



A Proofs 

Corollary 1. 










aj 



Vj(v? + 1) 



Proof. Because f{a; cf>) is a probability density the following relationship holds: 

K -r-rK jn ,, ^ ttX 



Y{ap ^da = 









r{p) 



i=i r 

Then applying the following property of the Gamma function 

r{n + 1) = nP{n) 

m can be calculated as follows: 



n ' 



1 + l 



da = 



3i=i / ^ 

r{p + i) 



p r{ip) 



(19) 

(20) 

(21) 

(22) 

(23) 



Then, by substituting (cni with (E3), it is easy to show m holds. We can also derive 
lHU in the same manner. □ 

Proof (of Proposition 1). The mean vector of the standard mixel distribution is the 
first order central moment of M{x), and it is calculated from the following formula: 

oo oo ^ 

fJ-M = J J xM{x)dx ^ JJ X I JJ f {a, cj>)N{fj,a, Sa,)da[ dx 

— oo — oo \ A ) 

It is not possible to directly solve this integral with respect to a. However, if we change 
the order of integral between a and x and first integrate with respect to x, this 
integral can be simplified as follows: 

OO 

P-M = jj f{a-(j))da J ^ xN{pa, Sa)dx 

A — oo 

oo 

= JJ f{a-,4>)da JJ {x - Pa + Pa)N{pa.,Sa)dx (24) 
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Since the normal distribution is even and (a; — /la) is odd around jia, 



OO 




Sa)dx = 0 



— OO 



( 25 ) 



Then can be further simplified as follows: 

OO 

fJ-M = jj f{a;cj))da Jj Sa)dx = JJ f {a; (f)) fia.da 



r{^) 






] do. 



II 



(V5) 



Finally, is integrated with respect to a using . 

K I V — r K I 

Z^i=l 






(26) 



(27) 



Proof (of Proposition 2). To obtain the covariance matrix, the second order moment 
around the mean vector, basically the same technique is applied; first integrating the 
equation with respect to x, then with respect to a. We solve the following equation: 



Sm = 



(28) 



jj f[a-,<p)da jj [x - plm){x - plm)^ N{ pia.,Sa.)dx 

A — OO 

Applying 12,511 into II28I . we obtain a simplified form: 

OO 

Sm = jj f{a\ 4>)da jj {x - Ha + fj.a - P.m)[x - fJ.a + fj-a - HM)'^N{fJ.a, Sa.)dx 
A — OO 

= jj f{a-,cf>)Sada + j j f{a-, - p.M){fJ.a - Hm)^ da (29) 

A A 

Splitting |(23) into two terms, the first term can be calculated as follows: 



/(a; <f)Eada = 



r{p) 






y~^ Oi Si j da 
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For the second term, we calculate the following relationship first: 
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Then, using (| 23 ), the second term can be simplified as follows: 
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Abstract. This paper presents a new feature vector for statistical pattern 
recognition based on the theory of moments, namely the Normalized Complex 
Moment Components (NCMC). The NCMC will be evaluated in the 
recognition of objects which share identical silhouettes using grayscale images 
and its performance will be compared with that of a commonly used moment 
based feature vector, the Hu moment invariants. The tolerance of the NCMC to 
random noise and the effect of using different orders of moments in its 
calculation will also be investigated. 



1 Introduction 

Geometric moments and moment based invariants have been widely used with 
success in the fields of image analysis and pattern recognition. A number of studies 
have been published by researchers which investigate their different properties, such 
as their applicability to different recognition problems ([1], [2], [3]), noise tolerance 
([4], [5]), computational complexity and simplification ([6], [7]) and hardware 
implementation ([8], [9]). 

This paper presents a new feature vector for statistical pattern recognition, the 
Normalized Complex Moment Components (NCMC), which is based on the theory of 
complex moments and maintains invariance with respect to translation, scale, 
illumination intensity and rotation transformations. The newly formed NCMC will be 
evaluated and compared with a commonly used moment based feature vector, the Hu 
Moment Invariants (HMI) ([10], [11], [12], [13]). The two feature vectors will be 
evaluated on a database of grayscale images which share identical silhouettes, relying 
on differences in their texture information for successful recognition. The effect of 
employing different classification measures in conjunction with the NCMC and HMI 
feature vectors will be investigated, as will the effect of employing moments of higher 
orders in the computation of both feature vectors. Finally, the noise tolerance of the 
NCMC will be compared against that of the HMI vector. 

The rest of this paper is organized as follows: Section 2 will give a brief 
introduction to geometric moments and their normalization procedures. Section 3 will 
provide an analysis of the NCMC in detail, followed, in Section 4, by a brief 
description of the Hu moment invariants feature vector. Finally, the comparative 
results of this study and conclusions will be presented in Sections 5 and 6 
respectively. 
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2 Geometric Moments 

For a discrete two-dimensional density distribution function g(x,y) (e.g. a two- 
dimensional image) with x = 0, i, M and y = 0, 1, N, the geometric moments 
of order p+q are defined as 

M N (1) 

■y“ ■ s{x,y) 

x=0 y=0 

Geometric moments are not directly suitable for object recognition since changes 
in the object’s position, orientation, scale (i.e. distance from optical sensor) and the 
intensity of illumination will cause their values to change. A set of normalization 
procedures needs to be applied so that invariant properties may be established. 

Translation invariance (i.e. invariance with respect to the object’s position in the 
image scene) is achieved by calculating the central moments These can be 
calculated directly from the ordinary moments using 



p q 

f^pq ~ 






■(-xj 


■[-y] with x = ^ 


,y = — 


(2) 


r=0 .v=0 








moo 


moo 





Scale and illumination intensity invariance is achieved by calculating a new set of 
moments n^^ from the central moments using 

u f 

_ pq Moo 

^pq ~ 

Moo V 20 + Q2 / 

Rotation invariance is slightly more difficult to achieve. The NCMC feature vector, 
which provides the desired rotation invariance properties, will be presented next. 



3 Normalized Complex Moment Components 



The NCMC is based on the theory of complex moments, first introduced by Abu- 
Mostafa and Psaltis in [14]. Abu-Mostafa and Psaltis demonstrate that only the 
magnitudes and not the phases of complex moments remain invariant under rotations, 
although both magnitudes and phases contain equally descriptive information 
regarding the image. The NCMC overcomes this by subjecting the complex moment 
phases to a suitable normalization procedure which makes them rotation invariant 
and, hence, utilizes all the available information regarding the image that is contained 
within the complex moments. 

The first step in the calculation of the NCMC is to calculate the complex moments 
of the image. The complex moments of an image g(x,y) in terms of its geometric 
moments are investigated in [15] and are given by 





m 



r+s,p+q-r-s 



( 4 ) 
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If the normalized moments of (3) are used in (4) instead of then the complex 
moments will be invariant to changes in translation, scale and illumination 
intensity. 

To demonstrate the phase normalization procedure of the NCMC the complex 
moments must be expressed in terms of magnitude and phase, i.e. 



C 



P‘1 




iO 

•e 



P‘1 



(5) 



Abu-Mostafa and Psaltis demonstrate in [14] that if the image is rotated by an 
angle 6 then the new set of complex moments is 



C =C -e 



‘{p-g)s _ 



= C„ 



A‘^Pt+ip-g 









( 6 ) 



and so 



II 


(7) 


1 W| 1 W| 




Kg=^Pg+ip-q)^ 


(8) 



Relations (7) and (8) indicate that only the magnitudes, and not the phases, of 
complex moments are invariant with respect to rotation. In the NCMC, the complex 
moment phases are normalized and made invariant with respect to rotation by using 
relation (8). 

For the calculation of the NCMC only the complex moments C with p>q are 
considered, since is the complex conjugate of (i.e. and IC^^I = \CJ). 

After the magnitudes of these complex moments have been calculated, the complex 
polynomials are divided into different groups. Each group contains only the 
complex moments with p-q = k. For each group one complex moment ir-s 
= k) is selected to serve as a reference vector for that group. In this implementation, 
the reference vector for each group is the complex moment of the highest order p+q 
within the group. 

These reference vectors are used for the calculation of the relative phases R4> , 

C pq-> 

given by 

pq = ^ pq (withp-^ = r-s) (9) 

When the image is rotated by an angle 0 then, using relation (8), the new relative 
phases are given by 



R^'pq=Kf =®:f +{r-s)e-^p^ -{p-q)0= W 

This shows that the relative phases defined in (9) are invariant under object 
rotations. From (9) it becomes obvious that the reference vectors themselves have a 
zero relative phase. It should also be noted that the complex moments of group G„ (i.e. 
C 22 , Qj, ...) also have zero relative phases since by definition they have no imaginary 




Statistical Pattern Recognition 535 



component. The final step in the calculation of the NCMC is the transformation of the 
magnitudes and relative phases to Cartesian (x,y) components which make up the final 
rotation invariant vector, using 



^ = |C;,,|-cos(/?O^J and 3; = |c^J.sin(/?O^J 



( 11 ) 



4 Hu Moment Invariants 



The results that will be presented later in this paper will perform a comparison 
between the NCMC vector and the commonly used Hu Moments Invariants (HMI). 
This section will briefly describe this rotation invariant feature vector. 

Based on the theory of algebraic invariants, Hu was the first to introduce moment 
invariants ([10], [11]) for rotation invariant object recognition. The first step in the 
calculation of the HMI is the calculation of the complex polynomials in terms of 
the geometric moments of the image. The polynomials are given by 






^p-r.r ^ 



■2r 



•(-0‘Z 



f A 



[Ij 
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p-2l—k,2l+k 



(12) 



with p-2r > 0. The Hu rotational invariants are then calculated by combining these 
complex polynomials using 



•/' + / •/' 

sq rp qs 



(13) 



with p-r = t(q-s). If the normalized moments of (3) are used in (12) instead of 
m then the resultant rotation invariants will also be invariant to translation, scale and 
illumination intensity changes. Using moments up to the 3' order Hu derived a set of 
seven invariant features, historically known as the “Hu invariants”. These are not 
included here but can be found in [11]. It is, however, possible to calculate Hu 
invariants using moments of higher orders, but their calculation becomes increasingly 
complex as the order increases. Wong et al. [16] describe an automatic approach for 
the calculation of higher order moment invariants, while Li [17] describes a method 
for deriving moment invariants of higher orders using the Fourier-Mellin transform. 



5 Results 

The data set used in this study can be seen in Fig. 1. This set splits into six classes, 
each class representing the same Printed Circuit Board with differences in particular 
components and their placement. The spatial resolution of these images is 256x256 
pixels and their grayscale resolution is 256 levels. Note that all the classes have 
identical silhouettes, so the feature vectors will have to rely on differences in their 
grayscale texture information to successfully identify them. Each of these images was 
subjected to combinations of translation, rotation, uniform scale (between -20% and 
+20%) and uniform illumination changes (between -33% and +33%), thus creating 
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400 images per class. For each class, 200 of those images were used for training and 
200 for testing. 




Fig. 1. Data set used for feature vector evaluation 




Fig. 2. An example of 20% noise addition (left) and subsequent median filtering (right) 



The robustness of both the NCMC and HMI vectors to image distortions was also 
tested. Random ‘salt & pepper’ noise was added onto the test images, which were 
subsequently filtered using a median filter, thus removing part of the noise but also 
blurring the images. The level of noise used was 20%. Fig. 2 shows an example of 
20% noise addition and subsequent filtering. 

Geometric moments of the 3"* up to the 6"" order were used in the implementation 
of both the NCMC and HMI vectors. Also, two different classification measures were 
used in conjunction with both feature vectors; a simple Euclidean distance classifier 
and a more complex weighted Euclidean distance classifier. 

Eor the Euclidean distance classifier the distance measure is given by 






( 14 ) 
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where x is the test sample feature vector, v. is the i"" class prototype feature vector 
(calculated during training) and n is the number of features. The test sample is 
assigned to the class i which minimizes d-. 

For the weighted Euclidean distance classifier the distance measure is given by 




where x, v and n are the same as for the simple Euclidean classifier. The weighting 
factor w. is calculated for each feature j and is the reciprocal of the mean of the m 
standard deviations over the m different classes for the /* feature, i.e. 

vv . = = with s = where m = number of classes 

' 5 ,- ' 

This weighting factor has also been used by Cash and Hatamian in their OCR study 
[3]. The test sample is assigned to the class i which minimizes d.. 

Tables 1, 2, 3 and 4 present comparative recognition results for the two feature 
vectors using different orders of moments and the two different classifiers. Tables 1 
and 2 present the recognition results when no noise is added onto the images, while 
Tables 3 and 4 present the recognition results for the two feature vectors for the 
distorted images, i.e. after 20% noise addition and filtering. 



Table 1. NCMC recognition results (in %). Noise-free images 



NCMC 




6'*’ order 


5* order 


4'*’ order 


3"^ order 


Euclidean 


100.0 


100.0 


100.0 


98.5 


Weighted Euclidean 


100.0 


100.0 


100.0 


98.6 



Table 2. HMI recognition results (in %). Noise-free images 



HMI 




6'*’ order 


5* order 


4'*’ order 


3"^ order 


Euclidean 


96.1 




93.1 


90.6 


Weighted Euclidean 


98.7 


99.1 


93.8 





It can be observed from Tables 1 and 2 that the NCMC delivers a higher 
recognition performance than the HMI. Eor the NCMC the best recognition scores are 
achieved using moments of the 4“' order or higher (100% recognition), while 3"* order 
moments produce slightly lower results. The HMI performs worse than the NCMC 
and produces the best recognition scores using S* order moments (99.1% recognition). 
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It is also interesting to note that, for the NCMC vector, the simple Euclidean classifier 
delivers the same high performance as the more complicated weighted Euclidean 
classifier. Eor the HMI, the difference in performance between the two classifiers is 
not very big either, although the highest recognition scores are achieved with the 
more complex classifier only. 



Table 3. NCMC recognition results (in %). Distorted images 



NCMC 




6'*’ order 


5* order 


4'*’ order 


3"^ order 


Euclidean 


99.0 


99.3 


98.8 


94.8 


Weighted Euclidean 


99.3 


100.0 


98.7 


94.2 



Table 4. HMI recognition results (in %). Distorted images 



HMI 




6'*’ order 


5* order 


4'*’ order 


3"^ order 


Euclidean 


72.9 


95.8 


83.3 


77.5 


Weighted Euclidean 


87.7 


96.0 


86.2 


77.5 



Erom Tables 3 and 4 it becomes clear that the proposed NCMC feature vector has a 
much higher tolerance to noise than the HMI feature vector. The NCMC is not greatly 
affected by the addition of noise, especially for moments of the 4“' order or higher, 
and still delivers very high recognition results (up to 100% recognition for 5* order 
moments). In contrast, the HMI is more heavily affected by the addition of noise and 
produces much lower recognition results. Eor the HMI the highest recognition scores 
are produced by the S"" order moments (96% recognition). It can also be noted that the 
NCMC delivers a very high recognition performance with both classifiers. For the 
HMI, however, the higher recognition scores are achieved only with the more 
complex classifier. 



6 Conclusions 

This paper has presented a new feature vector for statistical pattern recognition, the 
Normalized Complex Moment Components. The NCMC was evaluated and 
compared with the Hu moment invariants feature vector. The comparative results 
indicate that the NCMC has superior performance and greater tolerance to noise than 
the HMI vector. In addition to that, it was also found that the NCMC, used in 
conjunction with a computationally and conceptually simple classification measure 
such as the Euclidean distance, is capable of achieving performance levels that the 
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HMI can only approach by employing a more complex classification measure, a fact 
which further illustrates the discriminating power of the proposed Normalized 
Complex Moment Components feature vector. 
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Abstract. To match multiple views of a 3D scene, their relative geome- 
tric distortions have to be taken into account. We assume the disortions 
can be approximated by affine transformations. Images are matched by 
combining an exhaustive and directed unconstrained Hooke-Jeeves se- 
arch for affine parameters, image pyramids being used to accelerate the 
search. The parameters found for several matches are statistically pro- 
cessed to relatively orient the images. Experiments with the RADIUS 
multiple-view images show a feasibility of this approach. 

Keywords: multiple-view stereo, image matching, affine geometry 



1 Introduction 

3D scene reconstruction from multiple views captured by initially uncalibrated 
cameras merges cameras calibration and scene reconstruction into a single itera- 
tive process mm- Usually views can be ordered so that neighbouring images 
cover almost the same part of the scene and have relative distortions of a known 
range (see, e.g., Figures CHSl)- To begin the process, each image pair has to be 
roughly oriented by matching corresponding points. The initial pairwise orien- 
tation allows, in principle, to roughly calibrate all the cameras. Then the rough 
calibration can be iteratively refined, along with the reconstruction of a 3D scene 
model. 

In most cases, characteristic points-of-interest (POI) in the images such as 
corners m are matched. But generally the POI detection is not stable enough 
with respect to geometric and photometric image distortions as to ensure that 
the corresponding POIs can be simultaneously detected in different images. One 
may expect the POIs in one image allow to only select several subimages, or 
prototypes, to be matched to another image taking account of relative image 
distortions specified by a camera model. 

2 AfRne vs. Projective Image Distortions 

If cameras are sufficiently far from a 3D scene, the projective pin-hole camera 
model can be approximated by the less complex affine model m- For simplicity, 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 540-^^^ 2000. 
@ Springer- Verlag Berlin Heidelberg 2000 



Initial Matching of Multiple-View Images 



541 





M28 



M15 



Fig. 1. The top and next-to-top levels of the M28-M15 image pyramids. 




Fig. 2. The top and next-to-top levels of the M24-M25 image pyramids. 



let the photometric image distortions be uniform and excluded by equalising the 
grey ranges of the images. Then the matching of a rectangular prototype gi to 
a quadrangular area in the image g2 specified by the affine transformation can 
be guided by the mean square error (MSE) per pixel between the equalised grey 
values: 



D{a) 



' ' iGR 



( 1 ) 
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M29 



M30 



Fig. 3. The top and next-to-top levels of the M29-M30 image pyramids. 



Here, R is a rectangular lattice supporting the prototype, and the pixel i = (x, y) 
with the column x and row y coordinates in the prototype corresponds to the 
pixel f(i|a) = (xa,j/a) in the image 52, obtained by the affine transformation 
with the six affine parameters a = [oi, . . . , oe]: Xa = «ix + 02?/ + 03 and j/a = 
a4X+a^y+ae. We assume the origin ( 0 , 0 ) of the (x, y)-coordinates coincides with 
the lattice centre. Parameters (01,05), (02,04), and (03,05) specify, respectively, 
scales, shears, and shifts of the image 32 with respect to the prototype. The grey 
level 32(3^65 2/a) can be found by a particular, e.g., nearest neighbour or bilinear 
interpolation of the image 52 • 

Parameters a ensuring the best match, that is, the minimum MSE in Eq. dU 
between the prototype g± and the transformed image g2' 

D{a) = min£)(a) ( 2 ) 

a 

can be used to form an initial estimate of the camera model for the image 52 in 
relation to the known camera model for the image gi . This rough relative model 
can then be used as the first approximation to begin the search for a projective 
model yielding the best matches. 

3 Combined Search for AfRne Parameters 

The globally optimum match of Eq. 0 can be, in principle, found by exhausting 
all the possible geometric distortions a of the image 52 • But the direct exhaustion 
of the six parameters for finding the minimum distance D(a) is not computa- 
tionally feasible. More practical search (but for only a suboptimum match) is 
obtained by combining the exhaustion of a sparse grid of the relative shifts 03 
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and 06 with a direct Hooke-Jeeves unconstrained search from every grid 
position for all the parameters a that minimise the MSE. 

The Hooke-Jeeves search starts with the initial parameter values oi = 05 = 
1.0 and 02 = 04 = 0.0 and iterates two steps: the exploration and the directional 
search within a given range [oi^„iin, Q-i.max], i = 1 , . . . , 6, of the parameter values. 
At each exploration step, the value of a single parameter o^, i = 1,2, ... ,6, is 
changed with a given fixed increment ±<5^ to find whether the value of the MSE 
U(a) can be decreased comparing to its current value, given the fixed values of 
all other parameters. If the MSE decreases, the incremented value is substituted 
for the initial one, and the next parameter, i' = 2 , . . . , 6, 1, is explored until 
the MSE fails to decrease for all the parameters. Then the changes of the final 
parameters with respect to their starting values specify the possible direction of 
the MSE minimisation, and the directional search is performed by changing all 
the parameters simultaneously. These two steps are iterated at each grid position 
t until the local minimum of the MSE, £>(a[*l), is reached. The least value of 
for the entire grid specifies the desired match. The obtained parameters 
a[*l are refined by repeating the Hooke-Jeeves search from the position of the 
prototype given by the found parameters 03 [t] and Oejt]. 

Generally, the distance D{a) in Eq. ([3 is a multi-modal function of the affine 
parameters a, and the search may stop far away from the desired best match. 
To partially overcome this drawback, we use a pyramidal image representation 
such as in Figures [I]-|2I to begin with the more stable low-resolution matching 
and refine the matching results at the higher-resolution levels of the pyramid. 

POIs detected in gi at the lowest-resolution, or top level of the pyramid allow 
to select several large-size prototypes for matching to g 2 (see, e.g.. Figures EJ- 
EJ. Ideally, the affine parameters for a large-size prototype coincide with the 
matching parameters for the entire images. Therefore the statistically processed 
parameters found for the lowest-resolution prototypes can be used to relatively 
orient the images at the top level of the pyramids. Then these parameters are 
transferred to the higher-resolution levels for a subsequent refinement. In this 
case the scale and shear parameters preserve their values, and only the shift 
parameters should be changed according to the actual image resolution. 

The above approach differs from the conventional matching with estimation 
of linear transformations |5] in that it does not linearize the MSE because the 
large-size prototypes are involved, and hence it does not build and use the normal 
equation matrix. The latter matrix usually is ill-conditioned because it depends 
on the image derivatives with respect to the affine parameters. Therefore these 
parameters are estimated more reliably by directly exploring the MSE to find a 
path to the minimimum instead of guiding the search by an analytic gradient. 



4 Experiments with the RADIUS Images 

Three image pairs M15-M28, M24-M25, and M29-M30 from the RADIUS-M 
set 0 selected for experiments are shown in Figures 00 The next-to-top and 
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D(a*) = 163.4 




D(a‘) = 51.7 



Fig. 4. M28 prototypes (a-c) and their initial (d-f) and refined (g-i) M15 matches; 
ranges of 01 , 02 , 04,05 for g-i cases: [0.76,0.78], [-0.32,-0.10], [0.31,0.33], [0.68,0.72]. 






i 

D(a*) = 67.4 



n 

D{a*) = 45.4 



Fig. 5. M24 prototypes (a-e) and their initial (f-j) and refined (k-o) M25 matches; 
ranges of 01 , 02 , 04,05 for k-o cases: [0.87,1.00], [0.00,0.02], [0.02,0.05], [1.02,1.12]. 






f 

D(a*) = 36.5 



k 

D(a*) = 36.0 





j 

D(a*) = 50.6 



o 

L»(a‘) = 49.1 



the top levels of the image pyramids are built by scaling the original images of 
size 1350 x 1035 down to 489 x 384 and 244 x 192 pixels. 

We use only three-five prototypes per image pair, and the median values a of 
the best-match parameters found for these prototypes specify the affine matrix A 
for the relative image orientation and initial camera calibration. Each prototype 
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D(a*) = 64.9 L>(a‘) = 53.6 




D{a*) = 64.2 D{a*) = 52.4 







n m 



D{a*) = 50.7 




Fig. 6. M29 prototypes (a-e) and their initial (f-j) and refined (k-o) M30 matches; 
ranges of 01 , 02 , 04,05 for k-o cases: [0.98,1.02], [-0.02,0.02], [-0.04,0.04], [0.76,0.84]. 



covers about 20% of the image area and was rather arbitrary placed within 
the image provided several detected POIs are uniformly distributed within the 
prototype ^D]. The prototypes of size 110 x 100 taken from M28, M24, and M29 
at the top level of the pyramid and the initial and refined best matches of M15, 
M25, and M30 to these prototypes are shown in Figures SEl 

Here, the sparse grid of shifts to be exhausted has steps of 15 pixels in both 
directions, and shifts of the grid origin within the square 15 x 15 did not effect 
the final match. Although the photometric distortions of the images are non- 
uniform, the median values of the refined affine parameters [oi, . . . , og] for the 
best matches, namely. 



M28-M15: 


0.78 


-0.30 


154.9 


0.33 


0.70 


-124.0 


M24-M25: 


0.93 


0.02 


66.4 


0.04 


1.12 


-172.5 


M29-M30: 


1.00 


-0.20 


11.1 


-0.02 


0.80 


145.5 



reflect the basic geometric distortions of these images (here, the parameters 
03 and Og are scaled to the initial image resolution 1350 x 1035 pixels). The 
parameters vary within narrow limits indicated in Figures0-|3 The images M28, 
M24, and M29 relatively oriented with respect to the images M15, M25, and M30 
using the above affine parameters, are presented in Figure 0 

Refinement of the affine parameters at the next-to-top level of an image pyra- 
mids gives the similar results that also remain within narrow limits. Prototypes 
200 X 180 taken from M28, M24, and M29 at the next-to-top level of the pyra- 
mid and the refined best matches of M15, M25, and M30 to these prototypes 
are shown in Figures 1 ^ I i)l The median values of the parameters [oi, . . . , og] for 
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Fig. 7. Original images M15 (a), M25 (d), M30 (g) and the images M28 (b, c), M24 
(e, f), M29 (h, i) relatively oriented using the refined affine parameters estimated, 
respectively, at the top and next-to-top levels of the image pyramids. 



the best matches at this level are similar to the values found at the top level of 
the pyramids (see also Figure Q c, f, i): 



M28-M15: 


0.78 


-0.27 


157.4 


0.31 


0.71 


-126.7 


M24-M25: 


0.92 


0.03 


63.5 


0.03 


1.11 


-172.5 


M29-M30: 


1.02 


-0.01 


13.8 


-0.02 


0.80 


148.2 



Figure shows the overlaid original fields-of-view (FOV) of the M15 and 
M28 cameras in the 3D plane Z — 0 with the X- and F-ranges [—10,60] and 
[—20, 50], respectively, and the same FOVs after the affine transformation of the 
original projection matrix for the M28 camera so that the transformed projection 
matrix approximates the matrix for the M15 camera. Here, the affine parame- 
ters were estimated at the top level and the next-to-top level of the pyramid. 
Very similar results are obtained for the M24-M25 and M29-M30 cameras, too. 
Therefore the initial pairwise calibration of the cameras based on the affine di- 
stortion model seems to be sufficiently precise for the RADIUS images in spite 
of their non-uniform photometric distortions. 
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D{a*) = 66.4 D(a*) = 45.0 D{a*) = 60.9 

Fig. 8. Higher-resolution M28 prototypes (a-c) and their refined (d-f) M15 matches; 
ranges of ai, 02 , fl 4 , as for d-f cases: [0.77,0.79], [-0.27,-0.18], [0.31,0.34], [0.68,0.71]. 




f 

D(a*) = 48.3 



L>(a*) = 34.5 



D(a‘) = 7.9 



Fig. 9. Higher-resolution M24 prototypes (a-c) and their refined (d-f) M25 matches; 
ranges of 01 , 02 , 04,05 for five matches: [0.87,0.97], [0.01,0.03], [0.03, 0.06], [1.09, 1.14]. 
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L>(a*) = 66.3 






1 : 



D{a*) = 51.0 



K 

D(a*) = 42.4 



Fig. 10 . Higher-resolution M29 prototypes (a-c) and their refined (d-f) M30 matches; 
ranges of ai, 02 , 614,05 for five matches: [0.98,1.04], [-0.04,0.01], [-0.05, 0.01], [0.74, 0.83]. 




Fig. 11. Original FOVs for the M15 and M28 cameras (a) and the same FOVs after 
the affine transformation of the projection matrix for the M28 camera to the projection 
matrix for the M15 camera using the affine parameters estimated at the top level (b) 
and the next-to-top level (c) of the pyramid. 



5 Conclusions 

These and other experiments show that large-size areas in the multiple-view 
images of a 3D terrain can be in some cases successfully matched by combining 
the exhaustive and directed unconstrained Hooke-Jeeves search for the (locally) 
minimum MSE, provided the relative image distortions can be closely appro- 
ximated by the affine transformation. In our case, the Hooke-Jeeves method 
is preferred over other popular numerical methods for unconstrained optimisa- 
tion P (e.g., the Levenberg-Marquardt algorithm) because the Hooke-Jeeves 
one involves no analytically computed derivatives of the MSE in the parameter 
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space. For our large-size image prototypes, the Jacobian matrix of the first deri- 
vatives of the MSE to be used in the Levenberg-Marquardt algorithm is usually 
ill-conditioned and too noisy to correctly guide the search. 

The proposed approach to image matching has a moderate computational 
complexity so that in principle it can be used at the initial stage of the un- 
calibrated multiple-view terrain reconstruction. This approach permits us to 
simultaneously detect a small set of the initial corresponding points and form 
the relative affine camera models that roughly describe the pairwise image orien- 
tation. These models can then be used as the first approximation for estimating 
the relative projective camera models to begin the subsequent iterative process 
of refining the camera calibration and reconstructing the desired 3D scene model. 

This approach exploits almost no prior information about a 3D scene, except 
for a range of the relative shifts between the corresponding points. But the 
images to be matched are assumed to contain a sufficient number of POIs for 
choosing the prototypes. Also, to make this approach practicable the problem of 
automatic choice of the appropriate prototypes, given a particular spatial scatter 
of the POIs, has to be solved. 

References 

1. Dennis J.E., Schnabel R.B.: Numerical Methods for Unconstrained Optimization 
and Nonlinear Equations. Prentice-Hall, Englewood Cliffs (1983) 

2. Forstner, W.: Image matching. In: Haralick, R.M., Shapiro, L.G.: Computer and 
Robot Vision. Vol.2. Addison- Wesley, Reading (1993) 289-378 

3. Himmelblau, D.M.: Applied Nonlinear Programming. McGraw-Hill Book, New 
York (1972) 

4. Koch, R., van Gool, L. (eds.): 3D Structure from Multiple Images of Large-Scale En- 
vironments Lecture Notes in Computer Science, Vol. 1506. Springer- Verlag, Berlin 
Heidelberg New York (1998) 

5. Koenderink, J., van Doom, A.: Affine structure from motion. J. Optical Soc. 

America 8 (1991) 377-382 

6. Maybank, S.J., Faugeras, O.: A theory of self-calibration of a moving camera. Int. 
J. Computer Vision 8 (1992) 123-152 

7. Pollefeys, M., Koch, R., van Gool, L.: Self-calibration and metric reconstruction in 
spite of varying and unknown internal camera parameters. Int. J. Gomputer Vision 
32 (1999) 7-25 

8. RADIUS model board imagery and ground truth (GD-ROM). Intelligent Machines 
Lab., University of Washington. Seattle (1996) 

9. Shapiro, L.S.: Affine Analysis of Image Sequences. Gambridge University Press 
(1995) 

10. Zhang, J.Q., Gimel’farb, G.: On detecting points-of-interest for relative orientation 
of stereo images. In: D.Pairman, H. North (eds.) Proc. Image and Vision Gompu- 
ting New Zealand 1999. Univ. of Ganterbury, Ghristchurch, New Zealand, 30th-31st 
August 1999. Landcare Research, Lincoln (1999) 51-66 

11. Zhang, Z., Deriche, R., Faugeras, O., Luong, Q.T.: A robust technique for matching 
two uncalibrated images through the recovery of the unknown epopolar geometry. 
Artificial Intelligence J. 78 (1995) 87-119 




Local Discriminant Regions Using Support 
Vector Machines for Object Recognition 



David Guillamet and Jordi Vitria 

Centre de Visio per Computador-Dept. Informatica, Universitat Autonoma de 
Barcelona, 08193 Bellaterra, Barcelona, Spain 
Tel. +34 93-581 30 73 Fax. +34 93-581 16 70 
{davidg, jordi}@cvc .uab. es, 

WWW home page: http://www.cvc.uab.es/~davidg 



Abstract. Visual object recognition is a difficult task when we consi- 
der non controlled environments. In order to manage problems like scale, 
viewing point or occlusions, local representations of objects have been 
proposed in the literature. In this paper, we develop a novel approach 
to automatically choose which samples are the most discriminant ones 
among all the possible local windows of a set of objects. The use of Sup- 
port Vector Machines for this task have allowed the management of high 
dimensional data in a robust and founded way. Our approach is tested 
on a real problem: the recognition of informative panels. 

Keywords: Support Vector Machines, Local Appearance, Computer Vi- 
sion, Object Recognition. 



1 Introduction 

Visual recognition of objects is one of the most challenging problems in compu- 
ter vision and artificial intelligence. Historically, there has been an evolution in 
recognition research from 3D geometry to 2D image analysis. Early approaches 
to object recognition were based on 3D geometry extraction ITO7I but the pro- 
cess of extracting geometrical models of the viewed objects leads to a difficult 
problem and fragile solutions. Furthermore, these 3D geometry based techniques 
can be made to work in a controlled environment but their application to real 
environments generate several problems. 

An alternative to 3D reconstruction is to remain in the 2D image space wor- 
king with measurements of the object appearance. Turk and Pentland H3| used 
subspace methods to describe face patterns with a lower-dimensional space than 
the image space. The appearance of a face is the combination of its shape, re- 
flectance properties, pose in the scene and illumination conditions, and they use 
the Principal Component Analysis (PCA) technique to obtain a reduced space. 
Murase and Nayar jH] extended this idea using different instances of an object 
captured in a wide range of conditions (several viewpoints and illumination con- 
ditions) and used them to represent the object as a trajectory in the PCA space. 
Recognition is achieved by finding the trajectory that is closest to the projection 
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of an input image in the PC A space formed by all objects. Black and Jepson 
m have addressed the problem of partial occlusion by using robust estimation 
techniques in conjunction with PCA based projections. However, PCA based 
techniques suffer from several difficulties. Mainly, an image projection to a PCA 
based space depends on the precise position of the relevant objects, on the inten- 
sity and shape of background zones, and on intensity and color of illumination. 
Given that PCA technique treats its inputs (in our particular case, images) in a 
global manner, the relevant objects must be detected, segmented and normalized 
to manage them in the same way. This problem leads to a difficult process that 
can be unsolvable in certain cases. 

PCA analysis can be done on different image representation data. Hancock 
0 found that the results of applying a PCA projection over a set of natural 
images was nearly the same as a set of Gaussian derivative filters. Rao and 
Ballard m ascertained the results of Hancock with an extensive collection of 
images containing equal proportions of natural and man-made stimuli. Thus, the 
Gaussian derivative filters are natural basis functions useful for general-purpose 
object recognition and objects can be expressed as a set of reduced response 
vectors obtained as the result of an application of these filters. 

Current research on visual recognition of objects is focused on the identi- 
fication of physical objects from arbitrary viewpoints under arbitrary lighting 
conditions and being situated in an undetermined scene with possible occlu- 
sions. The presence of occlusions and different backgrounds can be minimized 
using local measurements instead of global treatments m- Some recent appro- 
aches [t)l 1 2\ focus on the fact that an object can be divided into small windows 
but only a subset of them are necessary to identify an object. The basic idea is 
to process an object obtaining a set of reliable points (those that can contain 
reliable information) and selecting some of them getting a discriminant subset. 
Ohba and Ikeuchi |2| use a measure of trackahility to obtain an initial set of 
candidate points that are reduced with an eigenspace projection. Schmid and 
Mohr use the well-known Harris detector to obtain their candidate points. 
However, some authors 0 consider the application of their descriptors on a pre- 
defined grid instead of on a set of selected interest points. This criteria is justified 
by the fact that objects captured in non controlled environments manifest some 
inestabilities in the procedure of extracting interesting points. 

Our approach is similar in spirit to the work of Ohba and Ikeuchi |0| who 
extract a subset of local windows of an object to identify it. We select a subset 
of local windows in a different way: using the Support Vector Machines (SVM) 
technique that provides an optimal separating hyperplane between two different 
classes with an intrinsic distance notion that can be exploited. Ikeuchi’s method 
does not depend on the the classification task given that a threshold distance 
must be defined in order to refuse similar local windows. The user must tune 
this threshold according to the objects nature, i.e, if the database is composed of 
similar objects, the threshold must be different from the one considered with a 
database of several kinds of objects. Our approach does not need a tunning para- 
meter that reflects the possible similarities of the training objects given that the 
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SVMs technique can extract and detect those training points that are conflictive 
(support vectors) without any external help. We have chosen a reduced set of 
objects took in a non controlled environment to test the basis of our approach. 
A set of local windows has been extracted from each object in order to minimize 
the future effects of occlusions and possible background problems and a sorted 
list of the local windows has been done to show the discriminant information 
that each local window contains. Objects have been normalized in a constant 
image size in order to consider their local windows in the same way. 

2 Support Vector Machines 

A two-class problem can be defined as: 

(xi,yi),..., {Xn,yn) ,x€^‘^,y€ {-K, -1} (1) 

where each example has an assigned value (-1-1 or —1) depending on the class 
that it belongs. In such particular case, SVM technique can be used to seek for 
an optimal separating hyperplane D{x) defined as: 

D {x) = {w ■ x) + Wo (2) 

Where wq is a threshold value and w is a weight vector. Figure o a shows a 
graphic representation of an optimal hyperplane. 




Fig. 1. Optimal separating hyperplane versus different possible hyperplanes. The opti- 
mal hyperplane is that hyperplane that defines a maximnm margin between the support 
vectors. Support vectors are indicated in grayvalues. 



Depending on the number of support vectors, Vapnik m states a genera- 
lization upper bound: 



[Error] < 



[Number of support vectors] 



n — I 



(3) 
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This expression estates that the expectation of the number of Support Vectors 
obtained during training on a training set of size n, divided by n — 1, is an upper 
bound on the expected probability of test error. 

The difficulty of separating a certain class can be minimized if this class is 
mapped to a higher dimensional space where the SVM technique can improve 
its separability property. Such mapping can be done without affecting the com- 
plexity of SVM decision boundaries given that SVM technique is independent of 
the new space dimensionality, which can be very large (or even infinite). SVM 
optimization takes advantage of the fact that all the operations that have to be 
carried out in such new high dimensional space (feature space) can be done in 
the input space via the evaluation of a kernel function k(x,y) defined by the 
inner product between support vectors and vectors in the input space: 

k{x,y) = {<P{x) -^{y)) (4) 

where {x) is a mapping function that maps an input vector a; to a feature space 
vector. Different kernels can be used: 

— Linear kernel: It is a simple inner product in the input space: 

k{x,y)=x-y (5) 

— Polynomial kernel of degree d: The optimal hyperplane will be defined 
as a polynomial expression: 

k{x,y) = [{x-y)+rf (6) 

— RBFs kernel: The optimal hyperplane will be defined as a radial basis 
function: 

k{x,y) = exp^^- ^^ | (7) 

Expression 0 can be expressed in terms of support vectors and a kernel function 
as: 

n 

D{x) = Y^a*ytk{xi,x) + WQ ( 8 ) 

i 

where a* are the lagrangian coefficients of the quadratic optimization. 

2.1 Mnlticlass Classification Using Support Vector Machines 

Dealing with a k— class classification problem, a set of binary classifiers /^, . . . , /^ 
has to be constructed, each trained to separate one class from the rest, and com- 
bine them by doing a multi-class classification according to the maximal output 
obtained by expression (0, i.e by taking: 

n 

argmaxj^i j^D^ (x) , where (x) = yiaj ■ k {x, Xi) + 

i=l 



( 9 ) 
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2.2 Hyperplane Distances 

Our approach is based on the fact that each training point has a relative distance 
to the optimal calculated hyperplane. The optimal hyperplane is defined by a 
set of support vectors, which are the closest and the most conflictive training 
points . Thus, extracting the most distant training points, we can obtain those 
points with a low probability of being misclassified. 

Figure @ schematizes our approach. Given a distribution of training points 
that have to be separated using Support Vector Machines, an optimal hyperplane 
is calculated. Figure (121 S') shows a complex distribution that it is not totally 
separable with a conflictive region where reside points of different classes. In 
such particular case, applying a linear kernel to obtain an optimal separating 
hyperplane implies that 10 training vectors are considered support vectors (as 
shown in figure 021b)). Given that support vectors are conflictive points, we do 
not consider them as relevant training points and we sort the rest of training 
points depending on their distance to the optimal hyperplane. 




Fig. 2. (a) Original Distribution, (b) Optimal Linear Hyperplane and Support Vectors, 
(c) Training points that are not support vectors sorted depending on their distance to 
the optimal hyperplane. The most distant point is the most important point. 



3 Experimental Results 

The main aim of our approach is the extraction of the most discriminant local 
windows belonging to a set of objects. A local window division of an object is 
justified by the fact that background and occlusion influences will be minimized. 
However, this local window division leads to generate a very large database of 
local windows and requires a prohibitive amount of memory to store all of them. 
The basic idea is that not all the local windows of an object are necessary to 
recover the identity of an object given that most of them can be redundant or 
can not contain discriminant information. In our case, we divide each object in 
a set of local windows (different divisions are considered) and we have sorted 
all of them by considering their discriminant information. Depending on the 
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final application and the memory space, the final user must select how many 
discriminant local windows has to use. 

In our particular case, we have chosen a reduced set of 8 panels situated on 
the walls of our building captured in different viewpoints and lighting conditions. 
We have done a panel mapping operation in order to obtain a set of panels with 
a known size and we have divided each panel using a predefined grid |3]. This 
grid defines a set of interesting points where we have applied a set of descriptors. 
In our case, we have decided to use a set of Gaussian derivatives filters as local 
image descriptors given that this image representation is speacially suited to 
visual discrimination HH. Figure 0 shows all the 8 different panels used in our 
experiments. 




(a) Panel 1 (b) Panel 2 (c) Panel 3 (d) Panel 4 




(e) Panel 5 (f) Panel 6 (g) Panel 7 (h) Panel 8 

Fig. 3. 8 different panels mapped to a standard window size of 175 x 250 pixels. 



All panels have been randomly divided to make up the training and testing 
set. The training and testing set are composed by 29 different instances of each 
panel. We have considered 7 different scales for our Gaussian filters and used up 
to third order derivatives. So, each interesting point has a response vector of 70 
dimensions. The Gaussian window size applied to all the following experiments 
is constant (37 x 37) and different sizes of panels are considered in order to study 
how affect its neighborhood. Before obtaining a response vector, we have applied 
an illumination intensity normalization that consists of substracting from each 
local window its gray value mean and considering its variance as: 



G 



( 10 ) 
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being ^ the region intensity mean and a its variance. 

3.1 Semiglobal Experiment 

Each panel is divided in 15 regions (3 horizontally and 5 vertically) obtaining a 
training and testing set of 3480 vectors. Panels are resampled to a window size 
of 85 X 125 pixels. Different kernels have been trained to separate each panel 
from the rest obtaining the results shown in table (Hi- 



Table 1. Results obtained dividing each panel in 15 regions (8 panels with 29 instances 
divided in 15 regions = 3480 different response vectors). Different kernels have been 
tested obtaining a good performance using a RBF kernel a = 0.5 with a test error rate 
of 4.87 %. Each table box has 3 different numbers: The number of Support Vectors 
obtained, the number of misclassified training vectors and the number of misclassified 
testing vectors. Last column shows the final test error obtained considering the maximal 
output obtained from the eight classifiers (see expression Q). 



Kernel 


Analyzed 

Feature 


Panel 1 


Panel 2 


Panel 3 


Panel 4 


Panel 5 


Panel 6 


Panel 7 


Panel 8 


Error 

rate 




# SVs 


372 


400 


530 


895 


648 


825 


768 


875 




Linear 


Train Error 


211 


206 


226 


679 


461 


600 


552 


650 


10.41 % 




Test Error 


241 


187 


194 


524 


320 


502 


429 


520 




Polynomial 


# SVs 


128 


66 


171 


304 


123 


306 


301 


246 




degree d = "2 


Train Error 


116 


0 


149 


313 


47 


425 


378 


213 


6.25 % 




Test Error 


166 


35 


189 


327 


122 


402 


365 


258 




Polynomial 


# SVs 


127 


75 


145 


277 


127 


265 


277 


238 




degree d = 3 


Train Error 


227 


0 


227 


359 


8 


224 


355 


242 


7.53 % 




Test Error 


316 


28 


275 


392 


98 


270 


370 


298 






# SVs 


481 


438 


638 


856 


772 


888 


887 


912 




RBF 


Train Error 


189 


94 


229 


309 


287 


412 


427 


391 


9.88 % 


a = 0.0005 


Test Error 


224 


93 


200 


367 


266 


408 


412 


402 






# SVs 


411 


158 


521 


841 


498 


828 


716 


868 




RBF 


Train Error 


116 


19 


166 


246 


117 


377 


270 


332 


6.03 % 


a = 0.005 


Test Error 


166 


15 


170 


315 


116 


378 


298 


350 






# SVs 


230 


78 


313 


525 


227 


554 


505 


450 




RBF 


Train Error 


28 


2 


37 


122 


15 


93 


122 


90 


5.39 % 


(T = 0.05 


Test Error 


85 


17 


108 


208 


68 


161 


142 


150 






# SVs 


133 


97 


155 


240 


179 


205 


249 


227 




RBF 


Train Error 


0 


0 


0 


12 


0 


1 


8 


9 


4.87 % 


a — 0.5 


Test Error 


63 


28 


74 


131 


64 


112 


130 


143 





Choosing the best kernel (the one with less support vectors and a low error 
rate), in such case the RBF Kernel with a — 0.5, we have applied the idea 
mentioned in section 12.21 to extract the most discriminant regions of each panel 
(considering the distance of each region to the optimal hyperplane). Figure (0J 
shows the sorted list of the discriminant zones from each panel according to the 
distance of each region to the hyperplane obtained with the RBF kernel a = 0.5. 
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(a) Sorted regions from panel 1. 

(b) Sorted regions from panel 2. 



(c) Sorted regions from panel 3. 
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(d) Sorted regions from panel 4. 

(e) Sorted regions from panel 5. 
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(f) Sorted regions from panel 6. 
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(g) Sorted regions from panel 7. 

'!■ F c"iTi ^nuinii 

(h) Sorted regions from panel 8. 






Fig. 4. Sorted regions according to the best kernel calculated in table (0). It can be 
seen that the first discriminant regions of all the panels are those who belong to central 
zones and the last ones are those who are homogeneous zones (bright zones that are 
conflictive) or belong to the panel regions where the title is (given that all the panels 
titles have a similar tonality). 



3.2 Semilocal Experiment 

In such experiment, each panel is divided in 45 regions (5 horizontally and 9 
vertically) obtaining a training and testing set of 10440 vectors. Panels are re- 
sampled to a window size of 125 x 175. In that case, each window contains less 
information about its neighborhood than the previous experiment given that the 
panel size is bigger than before. Different kernels have been trained to separate 
each panel from the rest and for lack of space, table 11.3. 211 only shows the best 
one. 

Table 11 , 3.211 shows that the final error rate and the number of support vectors 
increase given that in this particular case, each region contains less informa- 
tion than in the previous experiment. Having more local regions, white and 
homogeneous zones increase because each local window considers a more local 
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Table 2. Results obtained dividing each panel in 45 regions (8 x 45 x 29 = 10440 
different response vectors). In this table, only the best kernel result is shown. 



Kernel 


Analyzed 

Feature 


Panel 1 


Panel 2 


Panel 3 


Panel 4 


Panel 5 


Panel 6 


Panel 7 


Panel 8 


Error 

rate 




# SVs 


981 


368 


1291 


1392 


1232 


792 


601 


912 




RBF 


Train Error 


218 


16 


356 


871 


812 


281 


210 


304 


6.8 % 


(T ^ 0.5 


Test Error 


387 


140 


509 


903 


874 


393 


415 


544 





neighborhood. However, discriminant zones are concentrated in the same regions 
that in figure O- 



3.3 Local Experiment 

In such experiment, each panel is divided in 153 regions (9 horizontally and 17 
vertically) obtaining a training and testing set of 12240 vectors (only 10 instances 
of each panel are considered) . The panel window size considered in such case is 
175 X 250. The neighborhood considered in that case is lesser than in the previous 
experiment. Different kernels have been trained to separate each panel from the 
rest and for lack of space, table (EH) only shows the best one. 



Table 3. Results obtained dividing each panel in 153 regions (8 x 153 x 10 = 12240 
different response vectors). In this table, only the best kernel result is shown. 



Kernel 


Analyzed 

Feature 


Panel 1 


Panel 2 


Panel 3 


Panel 4 


Panel 5 


Panel 6 


Panel 7 


Panel 8 


Error 

rate 




# SVs 


2201 


1926 


2039 


1882 


2109 


2321 


2552 


2118 




RBF 


Train Error 


854 


781 


832 


698 


864 


917 


1021 


869 


13.08 % 


G = 0.5 


Test Error 


971 


896 


991 


817 


1005 


1011 


1221 


1067 





Table (13. 3B shows that the final error rate and the number of support vectors 
increase much more than before. The reason is that there are a lot of regions that 
are homogeneous or similar regions to other panels regions that have appeared 
as a consequence of a more local neighborhood treatment. However, discriminant 
zones are concentrated in nearly the same regions that in figure ®. 

4 Conclusions 

An automatic discriminant method has been developed in order to extract di- 
scriminant regions from a determined set of different objects. Objects have been 
divided in various levels of regions considering different neighborhood hierarchies 
and Support Vector Machines have been used to extract the most discriminant 
ones. Despite of the several experiments performed using different neighborhood 
hierarchies, all of them show that the most discriminant information is always 
localized in the central regions of a panel leading to consider the method as a 
robust one. The results are satisfactory enough to consider that Support Vector 
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Machines is a reliable technique to be applied to such discrimination problems. 
In our experiments, each object is divided in different regions considering several 
neighborhood hierarchies. Our method sorts these regions according to their di- 
stance to an optimal hyperplane calculated by the SVMs considering different 
kinds of kernels. The final window size has to be selected according to the degree 
of possible occlusions ( a major degree of occlusions will imply that the regions 
with extensive neighborhoods can not be used given that they will surely be 
partially occluded). 
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Abstract. Recognition of occluded objects in synthetic aperture radar 
(SAR) images is a significant problem for automatic object recognition. 
Stochastic models provide some attractive features for pattern matching 
and recognition under partial occlusion and noise. In this paper, we pre- 
sent a hidden Markov modeling (HMM) based approach for recognizing 
objects in synthetic aperture radar (SAR) images. We identify the pe- 
culiar characteristics of a SAR sensor and using these characteristics 
we develop feature based multiple stochastic models for a given SAR 
image of an object. The models exploiting the relative geometry of fea- 
ture locations or the amplitude of scattering centers in SAR radar return 
are based on sequentialization of scattering centers extracted from SAR 
images. In order to improve performance under real world situations, we 
integrate these models synergistically using their probabilistic estimates 
for recognition of a particular object at a specific azimuth. Experimen- 
tal results are presented using real SAR images with varying amount of 
occlusion. 

Keywords: hidden Markov modeling, object recognition, multiple reco- 
gnition models, SAR images 



1 Introduction 

One of the critical problems for object recognition is that the recognition ap- 
proach should be able to handle partial occlusion of objects and spurious or noisy 
data [1]. In most of the object recognition approaches, the spatial arrangement 
of structural information of an object is the central part that offers the most 
important information. We suggest an object recognition mechanism that effec- 
tively makes use of the available structural information as a whole rather than 
viewing the spatial primitives individually. Its nondeterministic model structure 
makes it capable of collecting useful information from distorted or partially un- 
reliable patterns. Many successful applications of HMM in speech recognition |2j 
and character recognition |3| attest to its usefulness. Thus, it is potentially an 
effective tool to recognize objects with partial occlusion and noise. However, the 
limit of traditional HMMs is that they are basically one dimensional models. In 
this paper we use the features based on the image formation process to encode 
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Training Images of 




Recognition Result 



Fig. 1. The HMM-Based approach for recognition of occluded objects. 



the 2-D image into 1-D sequences. We use information from both the relative 
positions of the scattering centers and their relative magnitude in SAR images 
to address the fundamental issues of building object models and using them for 
robust recognition of occluded objects. 

2 Technical Approach 

Figure [D provides an overview of the HMM based approach for recognition of 
occluded objects in SAR imagery. During an off-line phase, scattering centers 
are extracted from SAR images by finding local maxima of intensity. Both lo- 
cations and magnitudes of these peak features are used in the approach. These 
features are viewed as emitting patterns of some hidden stochastic process. Mul- 
tiple observation sequences based on both the relative geometry and amplitude 
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of the SAR return signal are used to build the bank of stochastic models. These 
models provide robust recognition in the presence of severe occlusion and unsta- 
ble features caused by scintillation phenomena where some of the features may 
appear/disappear at random in an image. At the end of the off-line phase, hid- 
den Markov recognition models for various objects and azimuths are obtained. 
Similar to the off-line phase, during the on-line phase features are extracted from 
SAR images and observation sequences based on these features are matched by 
the HMM forward process with the stored models obtained previously. Maxi- 
mum likelihood decision is made on the classification results. Now the results 
obtained from multiple models are combined in a voting kind of approach that 
uses both the object, azimuth label and its probability of classification. This 
produces a rank ordered list of classifications of the test image and associated 
confidences. 



2.1 Related Work 

Preliminary work using HMM models for recognition of objects in infrared ima- 
gery is described by Burger and Bhanu [4] . Kottke et al. [5] use Radon transform 
in conjunction with HMM for representation and segmentation of nonoccluded 
objects in SAR images. Fielding and Ruck [B| have used HMM models for spatial- 
temporal pattern recognition to classify moving objects in image sequences. Rao 
and Mersereau 0 have attempted to merge HMM and deformable template 
approaches for image segmentation. Template matching [8] and major axis [9] 
based approaches have been used to recognize and index objects in SAR images, 
however, they are not suitable to recognize occluded objects. Jones and Bhanu 
[10] use a geometric hashing kind of approach for recognizing occluded objects 
in simulated SAR images [10]. 



2.2 Hidden Markov Modeling Approach 

HMM is defined as a triple A = {A, B, tt), where is the probability that state 
i transits to state j, bij{k) is the probability that we observe symbol A: in a 
transition from state i to state j, and tt^ is the probability of i being the initial 
state. 

Recognition Problem — Forward Procednre: The HMM provides us a 
useful mechanism to solve the problems we face for robust object recognition. 
Given a model and a sequence of observations, the probability that the observed 
sequence was produced by the model can be computed by the forward proce- 
dure. Suppose we have a HMM A = {A,B,tt} and an observation sequence yj . 
We define Oi(t) as the probability that the Markov process is in state i, having 
generated yj, i.e., ai{t) = Sj[aj{t — l)ajibji{yt)], when t > 0, where ai(0) = 1 
or 0, depending upon whether it is an initial state or not, respectively. 
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The probability that the HMM stopped at the final state and generated yj 
is asp{T). After initialization of a, we compute it inductively. At each step the 
previously computed a is used, until T is reached. asp(T) is the sum of proba- 
bilities of all paths of length T. 



Training Problem — Baum- Welch Algorithm: To build a HMM is ac- 
tually an optimization of the model parameters so that it can describe the ob- 
servation better. This is a problem of training. The Baum- Welch re-estimation 
algorithm is used to calculate the maximum likelihood model. But before we use 
the Baum- Welch algorithm, we must introduce the counterpart of ai{t), 
which is the probability that the Markov process is in state i and will generate 
vT+i, i-e., + 1)], when 0 < t < T, f3i{T) = 1 or 0, 

depending upon whether it is a final state or not, respectively. The probability 
of being in state i at time t and state j at time t+1 given observation sequence 
yf and the model A is defined as follows: 

Ej(t) = P(Xt = i, A:,+i = j I yf) 

^ a^{t- l)a^jhj{yt)Pj{t) 



Now the expected number of transitions from state i to state j given yf at any 
time is simply and the expected number of transitions from state i 

to any state at any time is E^^iEk^ik{t) . Then, given some initial parameters, 
we could recompute aij, the probability of taking the transition from state i to 
state j as: 



Pl=iEj{t) 



(2) 



Similarly, bij(k) can be re-estimated as the ratio between the frequency that 
symbol k is emitted and the frequency that any symbol is emitted: 



bij{k) = 



P't:yt=kjij (t) 

ST=iEj{t) 



(3) 



It can be proved that the above equations are guaranteed to increase aspiT) 
until a critical point is reached, after which the re-estimate will remain the same. 
In practice, we set a threshold as the ending condition for re-estimation. 



So the whole process of training a HMM is as follows: (1) Initially, we have 
only an observation sequence yf and blindly set (2) Use yf and 

{A,B,tt) to compute a and /3 (3) Use a and [3 to compute 7 . (4) Use yf, 
(A, B, 7t), a, P and 7 to compute A and B. Go to step 2. 
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2.3 SAR Features 

The features of SAR images we used are the scattering centers (local maxima), 
and the magnitude of these scattering centers. Unlike the visible images, SAR 
images are extremely sensitive to slight changes in viewpoint (azimuth and de- 
pression angle) and are not affected by scale. As a result, the magnitude and 
location of scattering centers are very sensitive to rotational transformation. It 
is observed that typically less than 1/3 of the scattering center locations remain 
unchanged on XPATCH simulated SAR data [10]. Similar results are obtained 
with real SAR data at one foot resolution. Therefore, to recognize occluded ob- 
jects, we need to get SAR images of an object at a given depression angle under 
various finely sampled azimuth angles. Ideally, we can have 360 SAR images 
of an object with one image corresponding to one and only one azimuth angle 
between 0° and 359°. Thus, we treat an object under different azimuth angles 
separately and build a model for each azimuth angle based on the SAR image 
taken under this azimuth angle. 

Note that when building model for an object azimuth, a single model is in- 
adequate because of noise, articulation, occlusion, etc. Therefore, to increase 
robustness, we build multiple HMM models for a given object at a specific azi- 
muth, as discussed in the next section. 

2.4 Extraction of Observation Sequences 

There are many ways to choose observation sequences, but we want to use in- 
formation from both the magnitude and the relative spatial location of the scat- 
tering centers extracted from a SAR image. Also the sequentialization method 
should not be affected by distortion, noise, or partial occlusion and should be able 
represent the image efficiently. Based on the above considerations, we employ 
two approaches to obtain the sequences. We assume that the scattering centers 
have been sorted according to their magnitude in a descending order. Note that 
sequences 0\ and O 2 are of length n, whereas the O 3 , O 4 , O 5 are of length (n - 1 ). 



• Sequences based on amplitudes: 0\ = {Magi,Mag 2 T--,Magn}, 
where Magi is the amplitude of ith scattering center. 

• Sequences based on relative geometrical relationships: 

02 = Ml,2),d(2,3),...,d(n,l)} 

03 = Ml,2),d(l,3),...,d(l,n)} 

04 = M2,l),d(2,3),...,d(2,n)} 

05 = M3,l),d(3,2),...,d(3,n)} , 

where d{i,j) is the Euclidean distance between scattering centers i and j. 

Figure 121 gives an example to illustrate how we get the sequences. Sequence 
Oi is obtained by sorting the scattering centers by their magnitude (no location 
information) . It captures the characterisitcs of the magnitudes of the scattering 
centers. Sequences O 2 through O 5 are obtained based on the relative locations 
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Fig. 2. Example of an observation sequence superimposed on an image of a tank. 



of the scattering centers and magnitude of the scattering centers is not used. 
These four sequences capture the geometric structure of the scattering centers 
and measure the relative distances between the scattering centers. The actual 
distances between scattering centers are determined by the resolution of the 
SAR image and they are independent of the distance between the sensor and 
the target. Thus, sequences O 2 through O 5 are not dependent on scale. 

In experiments described in Section 3, we only consider the top 30 scattering 
centers (sorted in descending order of their magnitude). This is because we 
expect that the scattering centers with larger magnitude are relatively more 
stable than the weaker ones. 

2.5 Integration of Results from Multiple Sequences 

Since not all models based on various sequences for a particular object and azi- 
muth will provide optimal recognition performance under occlusion, noise, etc., 
we improve the recognition performance by combining the results obtained from 
multiple kinds of models. 

We have developed a voting-like method to integrate the results from models 
based on a given number of sequences. The algorithmic steps are: 

(1) For each test image, we collect the N highest possibilities in the test results 
corresponding to each of the sequences. Each possibility is the probability 
that the test image is the image of that object at that azimuth. 

(2) A normalization is done to the N probabilistic estimates corresponding to 
each of the sequences so that estimates from different sequences can be com- 
pared. 

(3) We draw a histogram with probability vs. object and azimuth for the results 
obtained in step (2). We pick up the object with the highest frequency in 
the histogram. 

(4) If the object associated with the highest frequency in the histogram is the 
same as the groundtruth, we count it as one correct recognition. 
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2.6 Recognition Procedure 

The recognition procedure is described as follows: 

(1) Loop (2), (3) for all the testing observation sequences. 

(2) Loop (3) for all the models in the model base. 

(3) Feed the observation sequence into the model, {A, B, Use Forward 

algorithm to compute the probability that this sequence is produced by this 
model. 

(4) The model with maximum probability of an observation sequence is selected 
as the best match. 

(5) For each testing sequence, perform integration on the results of individual 
sequences. 

(6) For each testing sequence, obtain final recognition result by selecting the 
object class with the highest frequency of occurrence. 

3 Experiments 

• Data: We use MSTAR public real SAR images (at one foot resolution and 
depression angle 15°) of 2 objects (T72 tank with serial number ^d£>4: and ZSU 
tank with serial number ^j(f!:d08). Ideally, we can have 360 object models for each 
azimuth for each object. However, we don’t have 360 SAR images for each object 
in the MSTAR data set. For the T72 tank, there are 288 images available for 
different azimuths. Also for the ZSU tank, 288 images are available. Thus, each 
object consists of 288 azimuths (or aspects) which we call object models. Each 
object model consists of HMM models based on observation sequences (Oi to 
O5). We extract 30 scattering centers with largest magnitudes from each SAR 
image. Figure 0 shows some examples of SAR imagery, region of interest and 
scattering centers superimposed on the SAR image. ROIs are obtained using a 
dilation/erosion process automatically. 

We consider the occlusion to occur possibly from 9 different directions (cen- 
ter, 4 sides and 4 corners of the image). Scattering centers being occluded are 
not available. Moreover, we add back into the image at random locations a num- 
ber of spurious scattering centers, equal to the number of occluded scatterers, 
of random magnitude. For example, for 30% occlusion, we remove 9 scattering 
centers from the center of one object or from one particular direction and add 
randomly 9 spurious scattering centers back into the image. We compute the ob- 
servation sequences based on the scattering centers available after the occlusion 
process has taken place. 

Training Data: We generate 91 training sequences of each type (Oi) from 
each SAR image. The first one is obtained from the original SAR image without 
occlusion. Then we occlude the SAR image from 9 directions. For each direction, 
the occlusion level is 5% and 10%. For each occlusion level, we extract 5 training 
observation sequences. So 91 sequences are generated from each image. We have 
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(a) SAR image of a 
T72 tank 



(b) Features extracted 
for T72 tank 







9 



(c) SAR image of a 
ZSU tank 



(d) Features extracted 
for ZSU tank 



Fig. 3. Real SAR images and region of interests (ROIs) (with peaks shown as black 
dots superimposed on the ROI) for T72 tank ^a64 and ZSU tank #d08. 



two objects and 288 SAR images of each object, thus the number of training 
sequences of each type (Oi) is 52,416. Since there are 5 kinds of observation 
sequences, the total number of sequences is 262,080. 

Testing Data: From each SAR image, we generate 37 testing sequences of 
each type {Oi). The first one is obtained from the original SAR image without 
occlusion. Then we occlude the SAR image from 9 directions. For each direction, 
the occlusion level is from 20% to 50% with 10% increment. For each occlusion 
level, we extract only 1 testing sequence. Thus, we have total 21,312 testing 
sequences. In our experiment, we require that the occlusion level be 20%, 30%, 
40% and 50%. That is, the numbers of occluded scattering centers are 6, 9, 12 
and 15 respectively. Also, when testing, we only use the sequences which are 
obtained when the occlusion was from direction 1, which is the direction from 
the right side of the image. So, for each occlusion level, we have 576 testing 





568 



B. Bhanu and Y. Lin 



sequences. Since there are 5 kinds of observation sequences, the total number of 
sequences for each occlusion level is 2880. 

• Training: We performed experiments to choose the optimum of number of 
states and number of symbols of the HMM. We find that with the increase in the 
number of states and symbols, recognition performance increases. Considering 
both the recognition performance and the computation cost, we choose 8 sta- 
tes and 32 symbols as the optimal number of states and symbols for our HMM 
models. Using the algorithm presented above we built recognition models. For 
one sequence type we have 576 (288 azimuths x 2 object classes) HMM mo- 
dels. Since we have defined five kinds of observation sequences for each image 
{Oi, 02 , 0 s, O 4 , O 5 ), we get models based on each kind of observation sequence. 

• Testing Results: During the testing phase, for a given observation sequence 
type (Oi, O 2 , O 3 , O 4 , O 5 ), each of the 576 testing sequences is tested against 
all models (576 models: 2 objects, each has 288 models for each azimuth angle 
available). An object model represents the object at a particular azimuth angle. 
Here we consider only the kind of an object, which we call ID, we count a re- 
cognition result as a correct recognition if the HMM model with the maximum 
probability is associated with the object from which the testing sequence was 
extracted. We do not consider the corresponding azimuth angles of the HMM 
model and testing sequence. 

The recognition results are shown in Table 1 and 2. These results are satis- 
factory at 30% - 40% of occlusion. Table 1 shows the results when we combine 
the results of all five sequences (Oi to O5). Table 2 shows the results when only 
sequences Oi and O 2 are used in integration. From these two tables, we can see 
that the performance degrades as the occlusion level increases. The results of 
Table 2 based on Oi and O 2 are slightly better than the results of Table 1 based 
on all the five sequences (Oi to O5). There are two reasons for this degradation 
in performance. 

— First, the results from sequences O3, O 4 , and O5 are less reliable than others. 
Each of these sequences measures distances from a specific scattering center, 
for example, O 3 measures distances of other scattering centers from (refe- 
rence) scattering center 1, O 4 measures distances of other scattering centers 
from scattering center 2, and so on. At lower rates of occlusion, there is lo- 
wer probability of prominent scattering centers (with lower numbers) to get 
occluded than at higher levels of occlusion. As a result, at higher level of 
occlusion, the performance of each of the sequences O 3 , O 4 , and O 5 deterio- 
rates (e.g., if the first scattering center is occluded, then the entire sequence 
O 3 is subject to error), which is reflected in the integration results. An alter- 
native is to use all the scattering centers as reference points to build models 
to achieve increased performance at the expense of increased computation, 
not just the top three scattering centers that we used in sequences O 3 , O 4 , 
and O 5 . 
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— Second, the relative distances in O 3 to O 5 are expected to be longer than 
those in 02- As a result of occlusion, the longer distances are more stron- 
gly affected than the relatively shorter distances in O 2 , which measures the 
relative distances between successive scattering centers. 

4 Conclusions 

We have presented a novel conceptual approach for the recognition of occluded 
objects in real SAR images. The approach uses multiple HMM based models for 
various observation sequences that are chosen based on the SAR image forma- 
tion and account for both the geometry and magnitude of SAR image features. 
We have demonstrated that HMM approach makes use of the available structu- 
ral information to solve the problem caused by occlusion and noise. It takes the 
spatial arrangement of structural information as a whole and is able to collect 
useful information from distorted or partially unreliable patterns. The results 
generally meet the desired goals at 30% - 40% occlusion. 



Table 1. Results of Integration (Sequences Oi to O 5 ) 



Occlusion Level 


1 # of Correct Recognition 


# of test 
sequences 


20% 


30% 


40% 


50% 


T72 Tank 


280(97.2%) 


233 (80.9%) 


200 (69.4%) 


164 (56.9%) 


288 


ZSU Tank 


283(98.3%) 


224 (77.8%) 


190 (66.0%) 


171 (59.4%) 


288 



Table 2. Results of Integration (Sequences Oi and O 2 ) 



Occlusion Level 


1 # of Correct Recognition 


^ of test 
sequences 


20% 


30% 


40% 


50% 


T72 Tank 


284(98.6%) 


249(86.5%) 


208(72.2%) 


186(64.6%) 


288 


ZSU Tank 


285(99.0%) 


238(82.6%) 


194(67.4%) 


175(60.8%) 


288 
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Abstract. . It is evident that the utility of an image or map will depend on the 
quantity of the information we can extract from it by the analysis of the spatial 
relationships of the phenomenon represented. For it, tools that describe aspects 
such as spatial dependence or autocorrelation in patterns are used. The statistic 
techniques that measure the spatial dependence are very varied, but all of them 
provide only scalar information about the variation of spatial properties in the 
pattern, without analyzing the possible directedness of the dependence 
mentioned. In this work, we make a vector approach to the analysis of spatial 
dependence, therefore, given a pattern, besides quantifying its autocorrelation 
level, we will determinate if statistics evidence of directedness exists, 
calculating the angle where the direction appears. For this we will use a 
parametric method when the normality of population can be assumed, and a 
non-parametric method for uniform distribution. 

Keywords: Spatial Dependence, Anisotropy, Directional Trend, Circular 
Statistics. 



1 Introduction and Previous Works 

The interpretation of the spatial distribution of a phenomenon, can only be done by 
the evaluation of both the global (large scale trend or values in each point in the 
space) and local scale effects due to the interaction of each point with its neighboring 
points [1]. 

The absence of these local effects do that the values of the phenomenon vary 
depending on the place, in other words, the values observed in a window change 
systematically, hence, it does not exist spatial dependence among values, and the 
process is heterogeneous or non-stationary. On the contrary, if the existence of local 
effects is detected, the process is spatially homogeneous or stationary. 

Spatial dependence is a particular case of homogeneity. In images whose elements 
show spatial correlation, it is verified that the existence of a concrete value of the 
phenomenon makes more credible this value to occur in near places. The existing 
statistics for determining the existence of spatial dependence among elements of an 
image are very varied [2][3], and they include non-spatial technics as ANOVA, error 
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terms in spatial autoregresive systems, Moran's I, Geary's c, Getis' G*, variograms, 
correlograms.... All of them determine the existence or absence of autocorrelation in 
an image, but they do not provide spatial information about the direction in which it is 
manifested, that is, the directional trend of the variation of parameters that define the 
thematic characteristics of the image. 

We propose in this paper a method, based on first and second-order bivariate 
circular statistics, that will allow us to study the spatial variability of the 
phenomenon, establish the existence of spatial dependence in an image and calculate 
the direction in which this appears, using a parametric procedure based on the 
standard and confidence ellipses for normal samples, and a non-parametric test for 
directionality for samples in which the condition of normality cannot be assumed. 



2 Difference Vectors Matrix 

Let the matrix that keeps the values of a window obtained from an image. From 
the difference vectors matrix or matrix of first order vectors is obtained. For 
that, every element w . is compared with the element which is diametrically opposed 
to it; if w-j is greater tham we will assign the difference between these two 

values (w, - value to V2j^2-.2i:+2-i: • greater than 

^2k*2-i2k+2-k ’ '^he assignment will be inversely. 



Wl.l ■ 


• >^l,c - 


■ 






• W^-d.c ■ 


■ K-d.ckd 


Wc,l • 








Wc.c-d ■ 




■ K.ckd 






■ 




1 


■ Kkd.c ■ 


^c+d,c+d 



The element v . of the matrix represents a vector with length r = v . and angle 
CD = ArcTan interpreted as a variation of the characteristic 

with intensity v-. in direction O) . From the and taking values <7 = 1 , 2 , 3 , ..., (n- 
l)/ 2 , we obtain V^/ 1 ), V3/2), ... V^/(m- 1 )/ 2 ), submatrix of whose central element 
(element that has d rows both up and down and d columns both right and left) is 
v^2d*iy2(2d*iy2- Each one of the matrix 2d*i 2d*i^d) will be formed by the elements v^. 
whose distance to the central element is less or equal to d. 

Given V(< 7 ), its mean variation vector (of length and mean angle ) is 
calculated as follows [ 4 ]: 



N = 



( 1 ) 



X I. b, ,.v,. 



, Cos[ArcTan(-: 



2d+l r / 

^ If ^b^ jV, j Sin[ArcTan(-r 



( 2 ) 




A Vector Approach to the Analysis of (Patterns with) Spatial Dependence 



573 









ArcTan^ if 

X _ 

180 + ArcTan^ if 

X 



X > 0 
X < 0 



( 3 ) 



KM)- 




|i-(i| = 0 or |j-t/| = 0 
0 otherwise 



( 4 ) 



If the mean variation vectors [d = 1..8] are calculated from the image in Fig. 1 and 
2, we will obtain respectively the values shown Tables 1 and 2. 



Table 1. Mean variation vectors calculated from Fig. 1 



d 


1 


2 


3 


4 


5 


6 


7 


8 


r. 


0.79 


0.78 


0.77 


0.66 


0.61 


0.66 


0.63 


0.44 




292 


290 


281 


270 


253 


237 


225 


222 



Table 2. Mean variation vectors calculated from Fig. 2 



d 


1 


2 


3 


4 


5 


6 


7 


8 


r. 


0.35 


0.23 


0.06 


0.10 


0.03 


0.07 


0.06 


0.26 




143 


28 


93 


231 


158 


254 


19 


216 




Fig. 1. Pattern with Moran's I = 0.34 Fig. 2. Pattern with Moran's I = 0.82 

If we calculate the first-lag autocorrelation of Fig. 1 and 2 using Moran‘s Index [5] 
examining the eight neighboring cells connected to each cell, we obtain the values 
shown in Table 3. We can observe that Fig. 1 presents a certain level of heterogeneity 
in the spatial distribution of the values. This fact is confirmed with a low index of 
autocorrelation (0.34). Fig. 2, on the contrary, shows an index near to one (0.82), what 
indicates a high spatial dependence. 







574 A. Molina 



Table 3. Autocorrelation parameters of Fig. 1 and 2 





Fig.l 


Fig. 2 


Number of cells included 


289 


289 


Mean of cells included 


91.82 


67.02 


Standard Deviation (a) of cell values 


3.74 


1.80 


Moran's I 


0.34 


0.82 



3 Bivariate and Second- Order Analysis 

It is evident that, from observations made for a concrete spatial lag, we cannot draw a 
conclusion about the behavior of values in the image for other lags no matter how 
many data are available and how sophisticated the analysis is. For this reason the 
statistical analysis must be performed in two steps called first and second-order 
analysis or first and second stage of analysis [6]: 

• For each A:-lag [k = l..t/] we reduce the variation vectors by calculating the mean 
variation vector m. . 

k 

• We combine the mean variation vectors rn^, iw, of above step and test their 

significance. Only then can we make statistical inference about the directional 
behavior of data. 

The vector is described by an angle (D^ and a module ; in other words, it has 
to be considered both the angles and the amplitude of his module. Under this 
condition, the pairs (rj,(D[),(r 2 ,(D 2 )’---->(^d’®d) become bivariate and second-order 
sample, and his treatment is considered second-order analysis. 



3.1 Standard Ellipse 

Among the tools used for second-order analysis we find the standard ellipse [7]; it 
serves exclusively for descriptive purposes. The tips of the vectors of a second-order 
sample form a scatter diagram of data with standard deviations in the x and y 
directions and a certain trend upward or downward. The standard ellipse describes 
this behavior in a condensed form: assuming normality, roughly 40% of the data 
points fall inside the ellipse and 60% outside. The parent population need not to be 
normal, although it is desirable that it does not deviate too much from normality. For 
drawing the standard ellipse two means, two standard deviations and a correlation 
coefficient are required: 

X. = r. Cos<l>j y, = Sindi; (5) 

y = iZyi 
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=^Z{x-x) 

Covix, y) = - x){y - y) 

The equation of the standard ellipse is 

A(x - xY + 2B((x - x)(y - 

where 

A =4 B = -rsiS 2 C = -sl D = (l-r^)Sisl W 

The midpoint of ellipse is at (3c, y) ; the semi-axes a and b (a< b) are 

a = \-^¥ b = \-^Y^ 

^ lA+C-R^ ^ U+C+/?J 

where 

R = [{A-Cf +4B^Y 

The sample shows the maximum variability in a direction of angle 6 . This is the 
angle by which the major axis is inclined versus the X axis (-90° < 9 < 90°). 

9 = ArcTan[^^] (13) 

The values of the parameters described, calculated from Table 1 and Table 2, are the 
ones shown respectively in Tables 4 and 5. Fig. 3 and 4 show the representation of the 
ellipses. 



Table 4. Values of the standard ellipse calculated from vectors of Table 1 



a : 


y 




^2 


Cov 


r 


A 


B 


c 


D 


R 


a 


b 


e 


0.073 


-0.368 


0.189 


0.370 


-0.004 


-0.062 


0.137 


0.004 


0.036 


0.005 


0.101 


0.370 


0.189 


272 



•^2 






r - Corr(x, y) = 



_ Cov(x,y) 



( 7 ) 

( 8 ) 



y)) + C(x - xY -D 



( 9 ) 



Table 5. Values of the standard ellipse calculated from vectors of Table 2 



X 


3' 


h 


^2 


Cov 


r 


A 


B 


C 


D 


R 


a 


b 


e 


-0.076 


-0.596 


0.295 


0.164 


-0.042 


-0.860 


0.027 


0.042 


0.087 


0.0006 


0.103 


0.329 


0.075 


332 



As we have commented in this paragraph, standard ellipse serves exclusively for 
descriptive purposes. Nevertheless, from it, we can estimate whether is reasonable the 
assumption of normality in the parent population. If the mean vector angles are 
uniformly spaced around the coordinates origin, we can consider that the population 
from which the sample is drawn do not differ from randomness or one-sideness, 
avoiding normality. When this occurs, the origin falls into the ellipse. For this reason 
we can consider, in a approximate way, that the parent population is normal if the 
standard ellipse contains the coordinates origin. 
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Fig. 3. Standard ellipse depicted from parameters contained in Table 4 
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Fig. 4. Standard ellipse depicted from parameters contained in Table 5 



3.2 Confidence Ellipse 

By the standard ellipse, we have described the spatial behaviour of the mean variation 
vectors quantifying, by the calculation of 0 , its directional trend. Nevertheless, we 
must determine if the directional trend is caused by random fluctuations of the vectors 
or whether, on the contrary, this is caused by the existence of directedness or 
anisotropic variability [8]. For testing directedness, the confidence ellipse is used. 

Confidence ellipse includes a region in the xy-plane that covers the unknown 
population centre with a preassigned probability Q = 1-a being therefore, a tool for 
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statistical inference. Assuming normality, this region has the shape of an ellipse 
which has the same centre as the standard ellipse and the same inclination of the 
major axis {9 ). The problem of determining the existence of directedness is solve by 
generating the confidence ellipse and testing if the origin in its interior. If this fact 
does not occur, the population centre cannot coincide with the origin, and (x, y) is 
significantly different from it, concluding that the mean vectors as a group are 
oriented in the direction y/ 

y/ = ArcTan[|] 

For the confidence ellipse, the coefficients A, B and C in Eq. (9) have the same values 
for standard ellipse in Eqs. (10). Being the coefficient D 

where 

( 16 ) 

Here, F 2 „_ 2 {oc) denotes the critical F value with 2 and n-2 degree of freedom and 

significance level a . Confidence ellipse has the same center and the same 9 value as 
the standard, since it is independent on the variable D, as Eqs. (6) (13) reveal. In both 
ellipses, principal axes coincide. Only the semi-axes are variable. Let be the semi- 
axes with the special parameter D = \. Then we obtain from Eqs. (11) for arbitrary 
values Z) <> 0 



a = 



b - biD^ 



Since a and b are proportional to D ^ , we obtain 



a = a^n^T 



b = b^n^T 



( 17 ) 



( 18 ) 



where , b^ are the semi-axes of the standard ellipse. From Eq. (9) is deduced that 
points which are inside the region limited by the confidence ellipse fulfil the 
inequality 

A{x - xf + 2B((x -x)(y-y)) + C{y-yf < D (19) 



If the origin falls within the ellipse, the population centre could coincide with the 
origin and the sample is not directed. In this case, inequality (18) is fulfilled with the 
special values x = 0 and y = 0. From Eqs. (14)(18), the condition for the existence of 
directedness with a level of significance a is 

T^<THa) (20) 



being 



Tha)^ 



n . if. 

\-A si ^1^2 s\ 



( 21 ) 
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Analysing the standard ellipse (Fig. 4) obtained from Fig. 2, we observe that the 
assumption of normality is reasonable, so that a confidence ellipse is applicable to it. 
We want to determine the confidence ellipse at a level of significance a = 5% or a 
confidence coefficient of Q = 95% from Fig. 2. As a first step we calculate T ^ using 
Eq. (15). From a table of the F-distribution we read = 5.14. Hence F ^ = 11.99. 
From Eqs. (14)(17) we obtain D, a and b parameters showed in Table 6. Our 
confidence ellipse (dashed curve depicted in fig. 5) is somewhat biggest than the 
standard ellipse. Since the confidence ellipse does not contain the origin in its interior, 
(x, y) differs significantly from the origin. Therefore, the mean vectors are oriented as 

a group, being the direction at ^ = -97.25° . 

Table 6. Parameters of confidence ellipse calculated from vectors of Table 2 





Eig.l 
a = 0.5 


Eig.l 
a = 0.1 


Fig. 2 
a = 0.5 


j2 


8.07 


11.99 


11.99 


a 


0.37 


0.45 


0.40 


b 


0.19 


0.23 


0.09 



In Fig. 3 we observe that, although the standard ellipse does not contain the origin, 
this is very close to it. Consequently, the assumption of normality must be done with 
certain reservations. Fig. 6 shows the confidence ellipses with errors 
a - 0.05 and a = 0.1 drawn from the parameters of Tables 5 and 6. The inside 
ellipse does not contain the coordinates origin, while the exterior one does, assuming 
the existence of directedness (i// = -11.16 ° ) with Q =90%, but not with 2=95%. 



3.3 Not Normal Bivariate Population 

The procedure described in the paragraph bellow requires that the second order scatter 
diagram be a sample drawn from a normal bivariate population. Nevertheless, there 
are occasions in which, seeing the circular histogram of the sample, this condition of 
normality cannot be assumed. When this happens, it is possible by the Moore's test 
[9], to know if the sample is directed or on the contrary, whether it presents a uniform 
distribution. This test is non parametric because only the ranks of r are used. 

In this procedure, we must rank the r. from smallest to largest, letting t. denote the 
rank of the /th mean variation vector. The null hypothesis for this test states the 
<5; are independently and uniformly distributed on the circle. Let 

C = Xf,Cos®, 5 = Xf,Sin®, £) = 7(C^+£»2) D* = D/ 

/ 

We reject the null hypothesis in favour of the hypothesis of directionality if 

D* >D*{a) 

As we have commented in the paragraph below, the proximity of standard ellipse 
to origin in Eig. 3, make us doubt about the suitability of considering the sample 



(22) 



(23) 
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drawn from a normal bivariate population. Table 7 shows the critical values of 
Moore's Test and confidence ellipses for different errors a , being their statistic tests 
respectively Z)’= 1.249 and = 8.749. With these values, according to the Moore's 
Test, a directedness with an error a > 0.010 exists. If, on the contrary, we use the 
confidence ellipse, we found directedness with a > 0.100 . The difference between 
both tests is due to the use of confidence ellipse supposes the assumption of 
normality, and this is a strong constraint that we cannot clearly establish visualising 
the standard ellipse. 



Table 7. Critical values of D* and 'f 



a 


0.001 


0.005 


0.010 


0.025 


0.050 


0.100 


D*{a) 


1.397 


1.300 


1.242 


1.148 


1.059 


0.949 


THa) 


63.000 


33.927 


25.480 


16.940 


11.990 


8.074 
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Fig. 5. Confidence (dashed curve) and standard ellipses depicted from values of Tables 4 and 5 



4 Conclusion 

Previous works about the evaluation of spatial dependence in patterns, based on 
statistics of spatial autocorrelation, provide an acknowledge exclusively quantitative 
about the topological structure and spatial relationships in the distribution of a spatial 
phenomenon, but they do not analyze neither the existence of spatial anisotropy nor 
the direction in which it manifest. In this paper we have described a method for: 

• Visualizing the spatial variability of the pattern. 

• Determining the normality of parent population. 

• Testing the existence of spatial trend or directedness in patters that show spatial 
dependence. 

• Calculating the direction y/ where that anisotropy is revealed . 
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Fig. 6. Confidence ellipses with errors a = 0.05 and a = 0.1 



For that, we have used a parametric method (confidence and standard ellipses) when 
the normality of population can he assumed, and a non-parametric method (Moore's 
Test) for uniform distribution. Both are based in first and second-order bivariate 
circular statistics. 
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Abstract. When dealing with normally distributed classes, it is well 
known that the optimal discriminant function for two-classes is linear 
when the covariance matrices are equal. In this paper, we determine 
conditions for the optimal linear classifier when the covariance matrices 
are non-equal. In all the cases discussed here, the classifier is given by a 
pair of straight lines which is a particular case of the general equation of 
second degree. One of these cases is when we have two overlapping clas- 
ses with equal means, which is a general case of the Minsky’s Paradox 
for the Perceptron. Our results, which to our knowledge are the pionee- 
ring results for pairwise linear classifiers, yield a general linear classifier 
for this particular case, which can be obtained directly from the para- 
meters of the distribution. Numerous other analytic results for two and 
d-dimensional normal vectors have been derived. Finally, we have also 
provided some empirical results in all the cases, and demonstrated that 
these linear classifiers achieve very good performance. 



1 Introduction 

The aim of statistical Pattern Recognition (PR) is to find a discriminant func- 
tion which can be used to classify an object, represented by its features, which 
belongs to a certain class. In most cases, this function is linear or quadratic. 
When the classes are normally distributed, it is not always possible to find the 
optimal linear classifier. In all the known results in this field, determining a linear 
function to achieve Bayes classification for normally distributed class-conditional 
distributions, has only been reported when the covariance matrices are equal HH. 

As opposed to optimal linear classifiers, many attempts have been made to 
yield linear classifiers, using Fisher’s approach P|> the perceptron algo- 

rithm (the basis of the back propagation Neural Network learning algorithms) 
0, Piecewise Recognition Models jS|, Random Search Optimization |inj, and 
Removal Classification Structures |Q. All of these approaches suffer from the 
lack of optimality, and thus although they find linear discriminant functions, the 
classifier is not optimal. 

In this paper, we show that there are other cases for normal distributions 
and non-equal covariance matrices in which the discriminant function is linear 
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and the classifier is optimal. One of these cases is when we have two overlapping 
classes with equal means, and mutually orthogonal covariance matrices. But 
as opposed to all the previously studied linear classifiers, the new techniques 
introduced here yield pairwise linear classifiers, which emerge as degenerate cases 
of the general quadratic classifier. 

Minsky showed that it is not possible to find a single linear classifier for the 
simple case in which the features of one class are the Exclusive-OR of a 2-bit 
binary vector and the features of the second class are the negated features. This 
paradox, also called the Minsky’s Paradox |7|, demonstrated that a single per- 
ceptron could not correctly classify in this simple scenario. As opposed to this, 
we show that it is possible to find two optimal linear discriminant functions, 
given as a pair of straight lines, which is a particular case of the quadratic di- 
scriminant function. These classifiers have some advantages over the traditional 
linear discriminant approaches, such as Fisher’s, perceptron learning, and other 
ones, because the classifier that we obtain is both linear and optimal. Finally, 
we conclude this introductory section by observing that, to the best of our kno- 
wledge, the results of this paper are pioneering. We are unaware of any work 
that has been done in statistical PR, which investigates the design and use of 
optimal pairwise linear classifiers. 



2 Pattern Classification 



2.1 Bayes Decision Theory 



The main goal of PR is to find the class that an object belongs to given its 
features. In statistical models, the features are represented as random vectors 
in the domain of the real numbers. In particular, the probability distribution 
function for a d-dimensional random vector X which is normally distributed is 



P{X) 






( 1 ) 



where M is the mean vector and E is the covariance matrix. 

Suppose we have two classes, wi and W 2 , with a priori probability, P{uji) and 
P{u> 2 )- We can write the general inequality specifying the Bayesian classification 
between two classes as follows: 



0J2 

p{X\LOi)P{iOi) ^ p{X\lO2)P{0J2) . (2) 

Wi 



Equality in 021) represents the discriminant function. Assuming that u>i and 
0 J 2 are represented by normal random vectors with covariance matrices Ai, A 2 , 
and mean vectors M\, M 2 , respectively. 
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Also, without loss of generality, we assume that oji and UJ 2 have the same a 
priori probability, 0.5. Taking the logarithm of both sides of (|2I), the discriminant 
function is: 

log - Ml)] + [{X - M 2 fX^\X - M 2 )] = 0 . (3) 

Consider the two cases in 0. When Ei = S 2 the discriminant function is 
linear . For the case when Ei and E 2 are arbitrary, the classifier results in a ge- 
neral equation of second degree in the form of a hyperparaboloid, hyperellipsoid, 
hypersphere, or a pair of hyperplanes or hyperboloids. Indeed, in our discussion, 
we are interested in the case when the classifier is a pair of hyperplanes. The 
classifier is a pair of straight lines for d = 2. 

In the interest of brevity, the proofs of the theorems stated here are omitted. 
They are found in Also the bibliography is abbreviated. A more complete 
bibliography and comparative survey is found in which is the unabridged 
version of this paper. 

As per our survey, the results on pairwise linear classifiers that we present 
here are the first formal reported results on using such classifiers in any avenue 
of statistical PR. 



3 Linear Discriminants in Diagonalization 

3.1 The 2-Dimensional Case 



Diagonalization is the process of transforming a space by performing linear and 
whitening transformations 0. A brief summary of this strategy can be found 
in m and omitted here in the interest of brevity. Given any arbitrary normal 
random vectors, Xi and X 2 , whose covariance matrices are Si and S 2 , we can 
perform simultaneous diagonalization to obtain normal random vectors, Vi and 
V 2 , whose covariance matrices are diagonal, namely I and A, respectively. In 
what follows, we assume that the dimension of our problem is d = 2. 

As to be more specific, we assume that after simultaneous diagonalization, 
the mean vectors and covariance matrices have the form: 



Ml = 



' Ah 

1 1 


II 


r 

s 


II 


'1 o' 
0 1 


, and S 2 = 


1 1 

0 7 

1 0 



(4) 



Since we will be, for the present, consistently dealing with two-dimensional 

X 

vectors we shall assume that the feature vector has the form X = 



y 



For our discussion, we let X ^ (M, S) denote a normal random vector, X, 
with covariance matrix S and mean vector M. We will now present a linear 
transformation that will later prove useful in simplifying complex expressions. 
The transformation is stated more formally in Theorem n 
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Theorem 1 . Let Xi ^ Si) and X2 N{M2x , ^2) be two normal 

random vectors, such that 



Ml, = 



p 

_q_ 


,M 2 , — 


r 

s 


II 


'1 o' 
0 1 


, and S2 = 


a 0 
0 6 



( 5 ) 



Vectors X\ and X2 can be transformed into Z\ ^ N{Mi^, Ei) and Z2 



N{M2z, S2), where = 



and -M2 2 = 



—t 

—u 



Theorem 2 . Let X\ ~ N{M\, Si) and X2 ~ N{M2, S2) be two random vec- 
tors, such that: 



( 6 ) 



r 

s 


,M 2 = 


—r 

—s 


,Si = 


'1 o' 
0 1 


, and S2 = 


1 1 

0 1 

1 0 



Ml = 

Lf a and b are positive real numbers, such that: 

a(l — b)r'^ — ^{ab — a — 6 + 1 ) log ab = {a — l)bs^ . 



( 7 ) 



then the optimal classifier obtained by Bayes classification is a pair of straight 
lines. □ 

Equation 0 is the necessary and sufficient condition that real numbers 
a > 0,6 > 0, r, and s, must satisfy in order to yield the optimal linear classifier 
between two classes represented by normal random vectors with parameters of 
the form given in (EJ. 

Consider the following: Given positive real numbers o > 0 and 6 > 0, we 
would like to find real numbers, r and s, that satisfy |3). The cases for which it 
is possible to find real numbers, r and s, satisfying (0) are stated below. 

Theorem 3 . Let Xi ^ N{Mi, Ei) and X2 ^ N{M2, S2) be two normal random 
vectors, such that 



Ml = 



r 

s 


II 


— r 
—s 


,si = 


'1 0' 
0 1 


, and S2 = 


a 0 
0 6 



( 8 ) 



For any positive real numbers a and b, there exist real numbers r and s only 
if a > 1 and 0<6<1, or0<a<l and 6 > 1. □ 



3.2 The d-Dimensional Case 

In Sect, rm we have presented the necessary and sufficient conditions in which 
the optimal classifier is a pair of straight lines, for the two dimensional space. We 
have also shown that for some covariance matrices, it is not possible to find the 
pairwise linear optimal classifier. We are now interested in finding the cases for 
the d-dimensional space in which the optimal classifier is a pair of hyperplanes. 

Let us consider the more general case for d > 2 . We are required to determine 
whether it is possible to find a pair of hyperplanes as the optimal Bayes classifier. 
This problem can be solved using the result of Theorem EI below. 
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Theorem 4 . Let Xi ^ N{Mi, Si) and X2 ^ N{M2, S2) be two normal random 
veetors, such that 





mil 




m 2 i 




'1 . 


. O' 




'ah . 


. 0 ■ 


Ml = 




,M2 = 




,^i = 






, and S2 = 








mid_ 




_m 2 d_ 




_0 . 


. 1 




. 0 • 


■ 



It is not possible to find a pair of hyperplanes as the optimal Bayes classifier. 



4 Special Cases of Linear Discriminant 

In this section we analyze two special cases of diagonal covariance matrices that 
lead to the optimal linear discriminant function. The necessary and sufficient 
conditions to achieve a linear classifier are discussed in both cases. The second, 
and more specific case, is that where the mean vector is the same for the two 
classes under consideration. 



4.1 Linear Discriminant with Different Means 



Consider two normal random vectors of dimension d = 2 . Using the diagonaliza- 
tion process discussed in m, any covariance matrices and mean vectors can be 
converted into the following: 



Ml 



1 1 


, M 2 = 


r 

s 


II 


'1 o' 
0 1 


, and S 2 = 


a 0 
0 b 



(9) 



Starting with normal random vectors and these parameters, we are interested 
in analyzing linear classifiers for a more particular case. 



Theorem 5 . Let Xi and X2 be two normal random vectors with covariance 
matrices and mean vectors as in m- It is possible to transform Xi, X2 into 

'a-i 0 



Zi = A^Xi, Z2 = A^X2, respectively, where A = 
covariance matrices and mean vectors have the form: 



0 U4 



and the new 



Mi,= 



1 1 

1 1 


II 


'U' 

s' 


II 


'a' O' 
0 b' 


^ (XTiid — 


'b' o' 
0 a! 



( 10 ) 



only if b = a 



-1 



We now state the conditions necessary to obtain a pair of straight lines when 
we have the form of (HU. 
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Theorem 6. Let Xi ^ N{Mi, Si) and X 2 ^ N{M 2 , S 2 ) be two normal random 
veetors such that 



Ml = 



1 1 


II 


r 

s 


II 


'a~^ 0 ■ 

0 5-1 


, and S 2 = 


■ 5-1 0 
0 0-1 



( 11 ) 



The optimal Bayes classifier is a pair of straight lines when (p—r)^ = (q—s)^, 
for a, b any positive real numbers. Moreover, if Mi = 
the condition simplifies to r'^ = s^. □ 



and M 2 = 



—r 

—s 



Theorem 0 states the necessary and sufficient condition for a pairwise linear 
classifier between two normal random vectors, with means and covariances of 
the form given in (E3- 

We consider now the more general case for d > 2. We are interested in 
finding the conditions that guarantee a pairwise linear discriminant function. 
This is given in Theorem 0 below. 



Theorem 7. Let Xi ^ N{Mi, Si) and X 2 ^ N{M 2 , S 2 ) be two normal random 
vectors such that 

Ml = [mil ■ ■ ■ rnii . . . mij . . . mid]'^ ,M2 = [mai . . . m2i ■ ■ ■ m2j ■ ■ ■ m2d]'^ , 



1 

q 

to 


0 


0 


0 


0 


0 


0 ■ 




'o’!! 


0 


0 


0 


0 


0 


0 ■ 


0 




0 


0 


0 


0 


0 




0 




0 


0 


0 


0 


0 


0 


0 


2 


0 


0 


0 


0 




0 


0 


4 


0 


0 


0 


0 


0 


0 


0 




0 


0 


0 


, and S 2 = 


0 


0 


0 




0 


0 


0 


0 


0 


0 


0 


2 

^33 


0 


0 




0 


0 


0 


0 


^ a 


0 


0 


0 


0 


0 


0 


0 




0 




0 


0 


0 


0 


0 




0 


_ 0 


0 


0 


0 


0 


0 






. 0 


0 


0 


0 


0 


0 


<^dd. 



The optimal classifier, obtained by Bayes classification, is a pair of hyper- 
planes when {mu — m 2 i)^ = {mij — m 2 j)^ , for k = 1, . . . ,d, a positive real 
number. □ 



4.2 Linear Discriminant with Equal Means 

In this section, we discuss a particular instance of the problem discussed in 
Sect. 14. ll Let us consider the generalization of Minsky’s paradox, that is, when 
Ml = M 2 . We shall now show that it is always possible to find a pair of straight 

lines when Mi = M 2 , Si = 

in the most general case. 



, and Si = 



b 0 
0 a 



, thus re-solving the paradox 
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Theorem 8. Let X\ ^ N{Mi, Si) and X 2 N{M 2 , S 2 ) he two random veetors 
such that: 



Ml = M2 




a-i 0 

0 6-1 



and S 2 



6-1 0 

0 a-i 



( 12 ) 



The optimal classifier obtained by Bayes classification is a pair of straight 
lines for positive real numbers a and b, where r and s are any real numbers. □ 



The power of this will be obvious when the classification results are discussed 
in a subsequent section. 



5 Classification 

5.1 The Discriminant Function 

In this section we discuss classification with the linear discriminant functions 
determined in Sect 14. for dimension d = 2. 

The discriminant functions for the cases discussed in Sects. Id. 1 1 14. II and 14. '/!l 
are quadratic equations that represent pair of straight lines. For the purpose of 
classification, we need to find one equation for each straight line. This is done by 
inspection or by solving the quadratic equation in terms of y. The second degree 
polynomial equations, have the following roots: 

y+ = Aix + i?i, and j/_ = A 2 X + B 2 ■ (13) 

Let us consider now the third case discussed in Sect. 14.21 The equation for 
each straight line can be found as per the following theorem. 

Theorem 9. Let Xi ~ N{Mi, Si) and X 2 ^ N{M 2 , S 2 ) he two random vectors 
such that 



Ml = M2 = 




a-i 0 

0 6-1 



and S 2 = 



6-1 0 

0 a-i 



(14) 



The equations of the linear discriminant functions, i.e. the optimal classifiers, 
are 



y+ = —X + (r + s) and y- = x + {s — r) . (15) 

□ 

Special cases of the discriminant functions for the distributions discussed in 
Sects, rm and 14. it are found in m 
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6 Simulation Results 

In this section we present some examples illustrating the different cases discus- 
sed in previous sections. In all of the examples we have chosen the dimension 
d = 2 and two classes, and u> 2 - We also discuss the empirical results obtai- 
ned after testing the linear classifier with 100 points for each class generated 
randomly using the maximum likelihood approach in estimating the parameters 
pg, assuming that they are of the form found in the respective cases. 

The two classes, uji and UJ 2 , are represented by two normal random vectors 
Xi ~ N{Mi,Si) , X 2 ^ N{M 2 , T' 2 ). We used one instance for each of the three 
cases and generated a set of 100 normal random points, in order to test the 
accuracy of the classifiers. 

In the first test (referred to as DD, whose plot is given in 1121) we consi- 
dered the pairwise linear discriminant function in diagonalization. We used the 
following covariance matrices and mean vectors (estimated from 100 training 
samples) to yield the respective classifier: 

.4599 0 

0 2.8232 ■ 

The accuracy of the classifier was 98% for uJi and 99% for co 2 - The power of 
the scheme is obvious! 

In the second test (referred to as DM, whose plot is also found in P^l) 
considered the pairwise linear discriminant with different means. The following 
estimated covariance matrices and mean vectors were obtained by using 100 
training samples. 

.3188 0 

0 1.808 ■ 

Using the above parameters, the pairwise linear classifier was derived. The 
plot of the points and the linear discriminant function are shown in |12| . The 
accuracy of the classifier was 91% for loi and 95% for uj 2 - 

To show the power of the scheme, we also tested our results for the case of 
the pairwise linear classifier with equal means (EM) for the generalized Minsky’s 
Paradox. By using 100 training samples generated with equal means but mirrored 
covariances, we obtained the following estimated covariance matrices and mean 
vectors: 

12.752 0 

0 .0904 ■ 

The plot of the points and the linear discriminant function from these esti- 
mates is given in Fig. □ The accuracy of the classifier was 96% for and 90% 
for u> 2 - The accuracy is very high in this case, despite the fact that the classes 
overlap and the discriminant functions are pairwise linear. The power of the 
results presented is obvious. 



Ml = M2 



5.9841 

6.1766 






.0904 0 

0 12.752 



,^2 



Ml 



-.9555 

.9555 



, M 2 



.9555 

-.9555 






1.808 0 
0 .3188 



,^2 



Ml 



1.0342 

1.8686 



,Mo 



-1.0342 

- 1.8686 



,ri = 



1 0 
0 1 
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Fig. 1. Example of pairwise linear discriminant with equal means for the case described 
in TheoremISl This represents the resolution of the general case of Minsky’s Paradox. 



7 Conclusions 

In this paper we have shown the problem of determining pairwise linear classifiers 
for the case of normally distributed classes. We have shown that, contrary to 
what is known, it is possible to find the optimal linear discriminant function 
even though that the covariance matrices are different. In all the cases discussed 
here, the functions obtained are pairs of straight lines, which are a particular 
case of the second degree general equation. 

By a formal procedure, we have determined the conditions for these particular 
discriminant functions in three cases. The first case occurs after diagonalization. 
We have explicitly derived the necessary and sufficient conditions for the cova- 
riance matrices and the mean vectors so as to yield a pair of straight lines for 
the optimal classifier. We have also shown that it is impossible to find a pair of 
hyperplanes as the optimal Bayes classifier in the d-dimensional case. 

The second case is when we have particular forms in the two diagonal cova- 
riance matrices. 

In the third case, assuming equal means, we have found that it is always 
possible to obtain a pair of straight lines when we have covariance matrices with 
the same form as found in the second case, thus re-solving Minsky’s paradox! 

The results derived in the paper have also been experimentally verified. The 
empirical results obtained show that the accuracy of the classifier is very high. 
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This is understandable since the classifier is optimal. The degree of this accu- 
racy is even more amazing when we recognize that we are dealing with a linear 
discriminant function for classes which are significantly overlapping. 
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Abstract. This paper deals with the optimum classifier and the perfor- 
mance evaluation by the Bayesian approach. Gaussian population with 
unknown parameters is assumed. The conditional density given a limi- 
ted sample of the population has a relationship to the multivariate t- 
distribution. The mean error rate of the optimum classifier is theoreti- 
cally evaluated by the quadrature of the conditional density. To verify the 
optimality of the classifier and the correctness of the mean error calcu- 
lation, the results of Monte Carlo simulation employing a new sampling 
procedure are shown. It is also shown that the Bayesian formulas of the 
mean error rate have the following characteristics. 1) The unknown po- 
pulation parameters are not required in its calculation. 2) The expression 
is simple and clearly shows the limited sample effect on the mean error 
rate. 3) The relationship between the prior parameters and the mean 
error rate is explicitly expressed. 



1 Introduction 

The Bayesian approach deals with unknown parameters as random variables and 
assumes their a priori distributions. The essential role of the a priori distribution 
has not been well known, and the validity of the Bayesian approach and its 
application has been long argued P . The fact that the Bayesian approach enables 
us to design the optimum classifier based on limited sample and to evaluate the 
mean error rate using known parameters alone is the essential attractiveness of 
this approach. 

This paper deals with the optimum classifier and the performance evalua- 
tion by the Bayesian approach. Gaussian population with unknown parameters 
is assumed. The conditional density given a limited sample of the population 
has a relationship to the multivariate t-distribution. As a result, the obtained 
optimum classifier is different from the quadratic classifier known to be optimum 
for Gaussian distributions with known parameters. Especially when the sample 
size of classes are not equal, the optimum discriminant function is not quadratic, 
and the decision surface is not hyperquadratics. 

The mean error rate of the optimum classifier is theoretically evaluated by 
the quadrature of the conditional density. For univariate case, the mean error 
rate of two-class problem with different sample size and different sample cova- 
riance matrixes is evaluated (not presented in this paper become of the page 
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limit). For multivariate case, the one with common sample size, common sample 
covariance matrixes, and common a priori probabilities is evaluated. Since these 
mean error rates are obtained by taking the expectation of the error rate over 
unknown population parameters dealt as random variables, they only depend on 
known parameters such as sample parameters, sample size, and the dimensiona- 
lity. In this point, the Bayesian mean error rate has its own interpretation and 
significance different from those of non-Bayesian mean error rate which requires 
the unknown population parameters in its calculation. To verify the optimality 
of the classifier and the correctness of the mean error calculation, the results of 
Monte Carlo simulation employing a new sampling procedure are shown. 

The optimum classifier based on the Bayesian approach was first derived by 
Keehn |2|. He studied the asymptotic properties of the optimum classifier and 
calculated type I error, which is the rejection rate for a given threshold value 
of the likelihood. However the mean error rate for two-class problem was not 
evaluated, and the properties of the optimum classifier except for the asymptotic 
properties were not studied. 

In subsequent sections, a case with unknown covariance matrix (with known 
mean vector) is described in Section 2 to 4. A new sampling procedure and the 
result of Monte Carlo simulation are described in Section 5. 

2 Sample Conditional Density of Gaussian Population 

Sample conditional density of d-dimensional feature vector X of Gaussian po- 
pulation with unknown covariance matrix given a sample x = ^ 2 ^ • ■ • , Al„} 

is expressed by 

p{X\x) = [ p(X\K)p{K\x)dK, (1) 

Js 

where K is the inverse of the population covariance matrix and S is d(d -I- 
l)/2 dimensional subspace on which K is positive definite. The density p{X\K) 
is the d-variate Gaussian distribution, and the density p{K\x) is the Wishart 
distribution of n„ degrees of freedom m 
Performing the integration (1), we have 

/ I 1 \ Uyi + t 

^ j p + ( 1 'I 2— 

p{X\x) = \ 1 + — (X - M)*A-1(X - M) 

1 ( 2 ) i J 

^ noEo + nS 
— 

riQ + n 
1 " 
n 

rin = no -I- n , (2) 

where M is the population mean vector, and Aq and no are an initial estimate 
of the population covariance matrix and the confidence constant, respectively. 
When no is set to zero, n„ and coincide to n and E respectively, and no 
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knowledge about the prior distribution is utilized. The notation r(x) denotes 
the gamma function. 

By variable transformation 



X-M = 






Un — d + I 



-T. 



( 3 ) 



T leads to the multivariate elliptical t-distribution with — d + 1 degrees of 
freedom [H|. 



3 Optimum Discriminant Function 

The optimum discriminant function for general case is derived from (2) as 



g{X) = -2\og{p{X\x)P{uj)} 

1 



= (n„ + 1) log O + — (X - MyS-\X - M) 
L nn 

+ log \Sn\ - 21og£) - 21ogP(u;) 

d 






( 4 ) 



4 Evaluation of Mean Error Rate 

The sample size, the covariance matrixes and the a priori probabilities are assu- 
med to be common to two classes. The logarithm of the likelihood ratio is given 

by 



1 I Hn — d + 1 



Un — d + 1 



X 



nn 

ft v^— 1 






( 5 ) 



The distribution of ((n„ — d-|- 1) is d-variate elliptical t-distribution 
with Un — d+1 degrees of freedom, and the distribution of h{X) is univariate t- 
distribution with the same degrees of freedom. The means of h(X) are given by 

1 / Tin d -fi 1 ^2 

= ~d\ 

2 V 

1 lUn - d+l ^2 

V2 = xW 

2 V Un 

5l = (M2 - - Ml) . 



( 6 ) 
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The variances of h{X) is given by 



af = (M 2 - - Mi){X - Miy\coi}E-\M2 - Mi) 



rin — d + 1 
nn — d — 1 



(M 2 - Mi)‘i;-i(M2 - Ml) = """ 



Tin — d — 1 



2 T^n — d + 1 2 

- On . 



an = 



Tin — d — 1 

Using these parameters the mean error rate is given by 



( 7 ) 



£ = P(wi)£l + P{uJ2)e2 




When riQ = 0, 

. = 1 (iy'(i-V)*') ■ 

The function 'Pn{x o)is defined by 

/ Xo 

tn{x)dx , ( 10 ) 

-00 

where tn is the univariate t-disrtibution with n degrees of freedom. 

The Bayesian formulas of the mean error rate OSl and Q have the following 
characteristics when compared with the non-Bayesian formulas. 

1. The unknown population parameters are not required in its calculation. 

2. The expression is simple and clearly shows the limited sample effect on the 
mean error rate. 

3. The relationship between the prior parameters no, Eq and the mean error 
rate is explicitly expressed. 

It should be noted that the Mahalanobis distance 5 in (0 is an apparent 
one which is calculated using the known population mean vector and the sample 
covariance matrix. 0 reveals two causes which increase the mean error rate due 
to the limited sample effect. One is that the area of the tail of t-distribution 
increases due to the reduction of the degrees of freedom. The other is that the 
apparent squared Mahalanobis distance between two classes shrinks by {d— l)/n, 
and increases the mean error rate (Fig. 0. The affection of the former is marginal 
and is negligible if n — d + 1 is greater than 20 or so, because the t-distribution 
with this degrees of freedom can be approximated by the Gaussian distribution. 



Optimum Classifier and Performance Evaluation by Bayesian Approach 595 



which is the t-distribution with infinite degrees of freedom. On the other hand, 
the affection of the latter is so severe and is not negligible unless the sample 
size is much larger than the dimensionality. Such shrinkage of the apparent 
Mahalanobis distance has its origin in the variable transformation by (3), and 
causes a problem so called ’’peaking phenomenon” or ’’curse of dimensionality” |3 
Hn . This undesirable phenomenon is caused and aggravated by neglecting the 
prior distribution by setting no = 0. The case for no yf 0 is discussed in Section 6. 




Fig. 1. Increase of mean error rate due to limited sample effect 



5 Computer Simulation 

5.1 Bayesian Sampling 

In the following computer simulation, a new sampling procedure called Bayesian 
sampling is employed together with the ordinary sampling procedure. Fig. |2| il- 
lustrates the relationship between the ordinary sampling (a) and the Bayesian 
sampling (b). In the ordinary sampling, specified size of sample are drawn from 
a specified population and the sample parameters are calculated. Fig. 0 (a) il- 
lustrates the case with a Gaussian population N{0,I) and three samples of size 
five with the sample covariance matrixes Ea, iff,, and The classifiers are desi- 
gned using these sample parameters and the mean error rate for the population 
is evaluated. Since the sample parameters are random variables, the expecta- 
tion of the error rate is taken by repeating the sampling for designing and test 
of the classifier. On the contrary the Bayesian sampling generates populations 
from which a sample with specified parameter, e.g. N{0,I), is extracted. When 
a sample of specified size is drawn from a temporal population iV(0,/), and 
the sample covariance matrix is Ea, the actual population is determined to be 
N{0, E~^). Since the population parameters are random variables in this case, 
the expectation of the error rate is taken by repeating the Bayesian sampling for 
the test of the classifier. The design of the classifier need not be repeated because 
the design sample is fixed through the experiment. In this example, the sample 
mean vector and the sample covariance matrix are assumed to be zero vector 
and identity matrix, respectively. The general procedure is described below. 
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Fig. 2. Relationship between ordinary sampling (a), and Bayesian sampling (b) 



The population parameters are determined so that the parameters of a sample 
drawn from the population is (112,^2). The parameters of a sample of size n 
drawn from a temporal population N{Q, I) are denoted by (/r-i, Si), i.e. 




Sl = ^^{X,-^il){X,-^ilf . ( 11 ) 

n — 1 



By setting 

Y = ^iA\^<P\{X-p.i) {Si^i=^iAi) 

the sample parameters are transformed to (0,/), i.e. 



n “ 



1 

n — 1 



= <p^A^ ^^<P\Si<PiA^ = <Pi^\ = I 



(12) 



(13) 
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and the population parameters of Y are given by 
E(V) = 

V(Y) = e\{Y - E{Y)}{Y - EiY)}^] = ^ , (14) 

where Ai and are the eigenvalue matrix and eigenvector matrix of Si, res- 
pectively. 

Further by setting 

Z = 'P 2 A 2 Y -\- 

= ^2Al^lAf^^\X + ^X2-^2 A\^iA~^E^I^II {S2^2=^2A2) (15) 

the sample parameters are transformed to (/i 2 , if 2 ), i.e. 



- n 1 ^ 

LJ2z, = <i>2aI-J2y^ 

n ri 



+ ^2 = 1*2 



2=1 
1 



2=1 



— ^ = <1>2AIIAI<P\ = <?2A2<Z>* = ^2 



.. 1 

i=l 

and the population parameters of Z are given by 
E(^Z) = fJ-2 — ^2A2 ^iA^ 

V{Z) = <l>2Al<l>iA~[^(l>\(l>iA~['^<P\Al<l>\ = ^2^|^r^^l^2 ■ 
When the population mean vector M is known, (H3 is replaced by 
E{Z) = M 

v{z) = <^2 a\s^^aI‘P\ 

1 



Si ^ -Y^{X,-M){X,-Mf. 



(16) 



(17) 



(18) 



In the following experiments, no is set to zero and the population is assumed 
to have known mean vector and unknown covariance matrix. 



Multivariate Case with Common Sample Covariance Matrix. Table. E 
and Fig.Elshow the results of experiments for multivariate case where the sample 
size, the sample covariance matrixes, and the a priori probabilities are all com- 
mon to two classes. The rows sim. are the results by the Monte Carlo simulation 
employing the Bayesian sampling, where the size of test sample is 1000, and the 
number of iteration is 5000. The row t shows the mean error rate by (jSI). 

The optimum discriminant function employed in the simulation is derived 
from (0. The sample covariance matrix is d x d identity matrix, and the popu- 
lation mean vectors are 



Mi = (0,0,0,---,0), 
M2 = (1,1,1,---,1). 



( 19 ) 
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Table 1. Mean error rate (%) v.s. dimensionality in multivariate two-class problem 
with common sample covariance matrixes 





n 


10 


15 


20 


d 




opt.{qdf.) 


opt.{qdf.) 


opt.{qdf.) 


2 


sim. 


25.97 


25.29 


24.95 




t 


25.96 


25.28 


24.95 


4 


sim. 


21.49 


19.44 


18.45 




t 


21.52 


19.43 


18.47 


6 


sim. 


21.26 


16.94 


15.17 




t 


21.30 


17.04 


15.28 


8 


sim. 


24.65 


16.49 


13.58 




t 


24.75 


16.59 


13.74 


10 


sim. 


35.32 


17.65 


13.22 




t 


35.24 


17.80 


13.29 




Dimensionality d 

Fig. 3. Theoretical mean error rate (%) v.s. dimensionality 



For these parameters, the Mahalanobis distance = n and © is minimized 
when d = {n + l)/2. 

Because the sample size and the sample covariance matrixes are common to 
classes, the optimum classifier and the quadratic classifier give the same results. 
The mean error rates predicted by the t-distribution is well coincident to those 
by Monte Carlo simulation. 



Multivariate Case with Different Sample Covariance. Fig. 0 shows the 
mean error rates of the optimum classifier and the quadratic classifier for two 
classes with different sample covariance matrixes. The mean error rates were 
evaluated by Monte Carlo simulation employing the Bayesian sampling, where 
the size of test sample and the number of iteration are 5000. The size of design 
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sample and the a priori probabilities are common to the classes. The sample 
covariance matrix of class 1 is 8 x 8 identity matrix, and the one of class2 is 8 x 8 
diagonal matrix with diagonal elements 



dmgS2 = (8.41, 12.06, 0.12, 0.22, 1.49, 1.77, 0.35, 2.73) . (20) 



The mean vectors are given by 

Ml = (-l,0,0,---,0), 

M 2 = -Ml. (21) 

The mean error rates of the quadratic classifier approach to those of the 
optimum classifier as the sample size n increases, however the optimum classifier 
outperforms the quadratic classifier for all sample size. 




0 10 20 30 40 

Sample size n 



Quadratic classifier 
Optimum classifier 



Fig. 4. Mean error rate of quadratic classifier and optiomum classfier v.s. sample size 
in 8-variate two-class problem with individual sample covariance matrixes 



6 Conclusion and Discussion 

[This paper dealt with the optimum classifier design and the performance eva- 
luation by the Bayesian aproach. To verify the optimality of the classifier and 
the correctness of the mean error calculation, the results of Monte Carlo simu- 
lation employing the Bayesian sampling were shown. It was also shown that the 
Bayesian formulas of the mean error rate have the following characteristics. 

1. The unknown population parameters are not required in its calculation. 

2. The expression is simple and clearly shows the limited sample effect on the 
mean error rate. 
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3. The relationship between the prior parameters and the mean error rate is 
explicitly expressed. 

In the Monte Carlo simulation, the property of the optimum classifier was 
studied when no was set to zero and the prior distribution was completely neglec- 
ted. When uq is not zero, the mean error rate is expressed by Q and is further 
minimized by selecting optimum uq which maximizes 




d-1 \ 

n + riQ J 



(M2 




n 

n + Tio 



S + 



no ^ 

— I “^0 

n -I- no 



(M2 



Ml) . 



( 22 ) 

The increase of no has similar effect as the increase of the sample size to 
add the degrees of freedom of the t-distribution, and to reduce the shrinkage of 
the apparent Mahalanobis distance. Therefore complete ignorance of the prior 
distribution by setting no to zero does not lead the best possible classifier. 

In most of the real world application, given sample parameters are fixed and 
the population parameters are unknown. The Bayesian sampling agrees better 
with these realities than non-Bayesian sampling, and provides us a new way of 
the Monte Carlo simulation such as the analysis of multi-category classification 
problems beginning with real world sample parameters at hand. 
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Abstract. For many statistical pattern recognition methods, distribu- 
tions of sample vectors are assumed to be normal, and the quadratic 
discriminant function derived from the probability density function of 
multivariate normal distribution is used for classification. However, the 
computational cost is O(n^) for n-dimensional vectors. Moreover, if there 
are not enough training sample patterns, covariance matrix can not be 
estimated accurately. In the case that the dimensionality is large, these 
disadvantages markedly reduce classification performance. In order to 
avoid these problems, in this paper, a new approximation method of the 
quadratic discriminant function is proposed. This approximation is done 
by replacing the values of small eigenvalues by a constant which is esti- 
mated by the maximum likelihood estimation. This approximation not 
only reduces the computational cost but also improves the classihcation 
accuracy. 

1 Introduction 

In conventional statistical pattern recognition methods, features are extracted 
from objects. The features are expressed in a form of feature vectors, and the 
probability density function of distribution of feature vectors is estimated for 
each category. An unknown input pattern is assigned to the category with the 
maximum probability PEI The estimation methods of the probability density 
function are classified into two types: parametric estimation and nonparametric 
estimation. 

In parametric density estimation, the forms for the density function is as- 
sumed to be known, and parameters of the function are estimated using the 
training sample vectors. The multivariate normal distribution is usually used as 
the density function. It is because the multivariate normal distribution is easy 
to handle and in many cases the distribution of sample vectors can be regarded 
as normal if there are enough samples. Mean vector and covariance matrix are 
calculated from the vectors. However, if there are not enough training sample 
vectors, covariance matrix cannot be estimated accurately. The estimation er- 
rors will increase in eigenvalue expansion, especially for the higher dimensions 
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0. Moreover, the computational cost will reach O(n^) for n-dimensional vec- 
tors. In the case that the dimensionality is large, these disadvantages markedly 
reduce classification performance. 

On the other hand, nonparametric density estimation is used without assu- 
ming that the forms for the density function is known. Many researchers have 
tried to estimate the distribution by nonparametric methods. In many cases, k 
nearest neighbor (fc-NN) ptidj or Parzen kernel- type Pd is used. Fukunaga et al. 
estimated the probability density function by using either fc-NN or Parzen proce- 
dures, and discussed the estimation method of Bayes error 0. Furthermore, for 
dimensional reduction, Buturovic used the fc-NN estimate of the Bayes error in 
transformed low-dimensional space as an optimization criterion for constructing 
the linear transformation d . Since these methods estimate arbitrary probability 
density functions which are not normal distributions, the computation time is 
significant and it is difficult to find optimal parameters. 

In this paper, we focus on the parametric density estimation using probabi- 
lity density function of multivariate normal distribution. In order to avoid the 
disadvantages mentioned above, a new approximation method of the quadratic 
discriminant function is proposed. This approximation is done by replacing the 
values of small eigenvalues by a constant which is estimated by the maximum 
likelihood estimation. By applying this approximation, a new discriminant fun- 
ction, called simplified quadratic discriminant function, is defined. This function 
not only reduces the computational cost but also improves the classification 
accuracy. 

2 Approximation of the Quadratic Discriminant Function 

First we give a brief review about the quadratic discriminant function, and then 
propose a new approximation method of the function. 



2.1 Quadratic Discriminant Function 

Let n be the dimension of feature vector. The well-known probability density 
function of n-dimensional normal distribution is, 

PN = (2^)n/2|^|l/2 - M)} > (1) 

where x is an n-component vector, /x is the mean vector, and S is the n x n 
covariance matrix. The quadratic discriminant function (QDF) is derived from 
Eq.pi) as follows. 



9{x) 



{x — fifS ^{x — 



/x) -klogiT'l 

n 

-k ^log Ai, 

2 = 1 



( 2 ) 

(3) 
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where Xi is the fth eigenvalue of E sorted by descending order, and </>j is the 
eigenvector that corresponds to Xi . This will be the minimum-error-rate classifier 
if the distributions are normal, prior probabilities of all categories are equal, and 
the parameters /x and S are known. However, in general, since the parameters 
are unknown, the sample mean vector (i and sample covariance matrix E are 
used. 



9{x) 



{x-iifE fi) +log\E\ 



( 4 ) 

( 5 ) 



Here, Xi is the fth eigenvalue of E and 4>i is the eigenvector. It is known that 
small eigenvalues in Eq. 6 usually contain many errors that cause the reduc- 
tion of recognition accuracy |H|. Moreover, the computational cost of Eq.® is 
0(n^) for n-dimensional vectors. In the case that n is large, it requires enormous 
computational cost. 



2.2 Simplified Quadratic Discriminant Function 

To avoid the bad influence caused by small eigenvalues and to reduce the com- 
putational cost, one considerable solution is replacing small eigenvalues by a 
constant. Eq. 0 is approximated by the following function. 



k 

9s{x) = 

2=1 



((a; 



k 



i 4 ^ 



k n 

+ ^ log Aid- Y 

2=1 2=fc+l 



X 



( 6 ) 



Here, A is a constant and k < n. Eq.® is called simplified quadratic discriminant 
function, or SQDF. In the case of fc = n, SQDF is the same as QDF. 

The value of A is determined by the maximum likelihood estimation. For 
simplicity, the first and third terms of Eq. are fixed, and the second and 
fourth terms are considered. In other words, the maximum likelihood estimation 
is performed in the (n — /c)-dimensional subspace determined by ..., 4>n}- 

Replacing small eigenvalues with A means that the variance on each axis in this 
subspace is assumed to be A. We deflne 



y = (yk+i,---,yn), 



( 7 ) 



where 

y, = {x- fiYk- (8) 

Since the variance of t/i is assumed to be A, the probability density function of 

y is, 



v{y) = 



1 






(2^A)("-fc)/2 " ^ 1 2A 



( 9 ) 



2 = fc+l 
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Note that yi and y are random variables. Let m be the number of samples and 
Vij (1 ^ ^ in) be the jth observation value of yi. Likelihood of A is 



L = 



1 



(27tA) 






E E4 

j = l 



( 10 ) 



Solving the equation 



we get 



g^logi = 0, 



n ^ m 



A= — - y - V ^ 

n — k m ' 

j=l 

1 

n — k 

i—k-\-l 



( 11 ) 

(12) 



In other words, A is the mean value of A^ (i = fc + 1, n). Since trif = X)T=i 
and ll® - All = Er=i((=*= - A)*‘/>^)^ Eq-(0 can be rewritten as, 

_A(^-a.)((*-a)‘A)“ , II* -Alt' 

AA, A 

k 

+ ^ log Ai -k (n - fc) log A, (13) 

i=l 



where 



A = 



- Y!1=i 

n — k 



(14) 



which can be calculated with k eigenvectors and k eigenvalues. Comparing with 
Eq.J3), the computational cost of Eq. ITTTll is reduced from O(n^) to 0{nk). 

Next, we investigate the form for the density function of Eq. m and Eq.(|ni). 
The first term of Eq. o, and the first and second terms of Eq. o are only 
considered, because the other terms are just the normalizing terms. Let ei, 62 , 
63 be the expected values of the first term of Eq. 0), the first term of Eq. 0 , and 
the second term of Eq.0, respectively. For simplicity, the case that there are 
enough samples is considered. Since (a; — y,)*<pj^/^/Xi follows normal distribution 
A^(0, 1), the first term of Eq.0 will follow distribution with n degrees of 
freedom. Then, 

Cl = n. 



In the case that there are enough samples, the first term of Eq.0 will follow 
distribution with k degrees of freedom. 



62 = k. 



A New Approximation Method of the Quadratic Discriminant Function 605 



Since Xi represents the variance of the component projection onto the vector 0^, 
and the expected value of {{x — is A^. Then the expected value of the 

second term of Eq. m is 

A ■ 

Substituting Ea. flT^ . 

es = n — k, 




is obtained. Therefore we get 



ei — 62 + 63, 

namely, the expectation values of the two expressions are equal. This means 
Eq.® gives a good approximation of Eq. ®. 

As related approaches, quasi-Mahalanobis distance (QMD) PI3| and modi- 
fied Mahalanobis distance (MMD) have been proposed. The QMD neglects 
the third and fourth terms of Eq.® and replaces A by A^+i. The MMD only 
uses the first term of Eq.®, and instead of Ai, Xi + b is employed, where b is 
a bias determined experimentally. The modified quadratic discriminant function 
(MQDF) IE] is derived from the Bayesian estimation of the covariance matrix, 
and Ai of the first and third terms of Eq.® is replaced by Xi + X. The value 
of A is determined experimentally. All of these methods have been proposed to 
improve recognition accuracy but not to approximate the quadratic discriminant 
function. SQDF is an approximation of the quadratic discriminant function, and 
can describe the form of the distribution. Moreover, since SQDF is derived from 
the maximum likelihood estimation, it is not only appropriate as a classifier, but 
it also can be used for model complexity identification with information crite- 
rion such as Akaike’s Information Criterion (AIC) pi ,'I) or Minimum Description 
Length (MDL) |l4p . 

2.3 Model Identification 

In SQDF, the only parameter that is not determined is k, that is, the number 
of reliable eigenvalues. The other parameters are calculated automatically with 
samples. Of course, k can be chosen arbitrarily or experimentally. In recognition 
systems which handle large number of categories with high dimensional vectors, 
small value of k should be chosen in order to limit the computational cost. 

However, if we attach greater importance to the form of distribution, the 
value of k can be determined by information criterion. This is a kind of model 
identification. Let the numbers of parameters of mean vector, eigenvalues and 
eigenvectors be A^i, N 2 and N^, respectively. Since mean vector is n-dimensional, 
Ni = n. Since the number of eigenvalues is k + I {k < n) 01 k {k = n), N 2 = 
min(fc + 1, n). The number of parameters of eigenvectors is, 

2 kn — k"^ — k 



N 3 = {n — 1 ) + {n — 2 ) + ... + {n — k) 



2 



(15) 
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Total number of parameters of SQDF is 

N, + N 2 + N,= (2^-fe)(fe + l) + 2min(fc+l,n) ^ 

Let Xj be a sample (j = 1, 2, m). The AIC is written as, 



( 16 ) 



AIC = 2 9 s{xj) + (2n — k){k + 1) + 2 min(fc + 1, n), 
1=1 



while the MDL is written as, 

m 

MDL = '^g^{xj) + 
1=1 



(2n — k){k + 1) + 2min(fc + 1, n) 



(17) 



logm. (18) 



The value of k is determined to minimize the criterion AIC or MDL. 



3 Experiments 

In order to confirm the effectiveness of SQDF, three types of experiments are 
carried out. 

3.1 Effectiveness as a Classifier 

The first experiment is done to confirm the effectiveness of SQDF as a classifier. 
Character recognition is performed using character images included in the NIST 
Special Database 19 m The database includes over 800000 handprinted digit 
and alphabetic character images. Digit character images of ‘0’ and T’ are used 
in the experiment. The numbers of samples of ‘0’ and ‘1’ are both 40000. As 
the feature vector, the improved directional element feature m is used. This 
feature is 196-dimensional vector. 

For each category, m images out of the first 10000 images are used as trai- 
ning sample data, and the next 10000 images are used for evaluation. From the 
training sample data, feature vectors are extracted, and mean vectors and cova- 
riance matrices are calculated. Then SQDF and QDF are used as discriminant 
functions. The results are shown in FiglU FigHKa) shows error rates of various di- 
mensionality k of SQDF. The number of training samples is fixed to m = 10000. 
Here, the case of fc = 196 of SQDF equals to QDF. From the figure, the error 
rate of SQDF in the case of fc = 30 is much smaller than the case of k = 196, that 
is the result of QDF. FigHKb) shows error rates of various number of training 
samples. The dimensionality is fixed to k = 30. These results show that SQDF is 
much more effective than QDF if the number of samples is small. In the case of 
m < 2000, the error rates of QDF becomes extremely large, however, the error 
rate of SQDF changes little. 

All of these results clarifies that SQDF not only reduces the computational 
time but also improves classification accuracy. It is especially effective in the 
case of small number of training samples. 
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Dimensionality /t Number of sampels m 

(a) (b) 

Fig. 1. Results of character recognition, (a) Error rates of various dimensionality, m = 
10000. (b) Error rates of various number of samples, k = 30. 



3.2 Validity of Approximation 

Next, in order to confirm the validity of approximation, experiment using arti- 
ficial data is carried out. Since Eq. 0 is supposed to approximate Eq. 0 , it is 
required that the difference between Eq. 0 and Eq. 0 should be small. 

Suppose Pq is an n-dimensional vector, and Vq is an appropriate nxn cova- 
riance matrix. Here, we use the mean vector and the covariance matrix which are 
calculated with 10000 character images of ‘0’ in Section IdTI Since the improved 
directional element feature is adopted, n = 196. By producing random numbers, 
m training vectors that follow n-dimensional normal distribution A(pg, ifo) are 
produced. The sample covariance matrix S and sample mean vector p are cal- 
culated with m training vectors. Other 10000 vectors that follow V(pq, Vq) are 
randomly obtained to be evaluation vectors. The value of Eq. © (SQDF) of each 
evaluation vector is computed with p and E. Suppose the value gtme obtained 
by Eq.® with Eq and pg is the true value of QDF. The error e is given as 
6 = I (5s ~ 9 true) / 9 true\- The average of e of evaluation vectors is calculated. The 
error of QDF is calculated in the same manner. 

FigEl shows the errors of SQDF and QDF computed with various m. Note 
that QDF can be calculated only if m > n, however, SQDF can be calculated 
even in the case of m < n if fe < m. In all cases, the larger m, the more similar 
the estimated value to the true value. In the case that the number of training 
samples is small, the error of SQDF is smaller than that of QDF. In the case 
of TO = 1000, that means the number of samples is about five times larger than 
the dimensionality, the error of SQDF becomes slightly larger than that of QDF. 
However, the big difference of computational time between SQDF and QDF still 
offers the attraction of SQDF. 
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Fig. 2. Errors of each discriminant function. 



3.3 Validity of Model Identification Method 

The third experiment is carried out to confirm the validity of model identification 
method described in Section ITTH Suppose is an appropriate nxn covariance 
matrix, and fii is an n-dimensional mean vector. Here, we consider the following 
diagonal matrix and vector. 

= diag(l, 1, 1, ..., 1, 2, 3, 4, ..., n/2 + 1), (19) 

' V ^ 

n/2 n/2 

Mi = (0,0,...,0)‘. (20) 

Si consists of n/2 components those values are 1 and n/2 components those va- 
lues are larger than 1. Because Si is a diagonal matrix, each diagonal component 
corresponds to eigenvalue. In this section, the dimensionality is n = 16. 

m training vectors that follow n-dimensional normal distribution Si) 

are produced in the same manner as described in Section E21 The sample co- 
variance matrix S and sample mean vector fi are calculated with the training 
vectors. Then the value of k is determined by AIC or MDL as described in 
Section IQ In this case, the number of small eigenvalues that are regarded as 
constant is n/2. Since SQDF regards k small eigenvalues as constant, the value 
of k is expected to be determined that k = nj2. 

FigsOJa) and (b) show the values of AIC ('Fn. dl Yll i and MDL ('Fn. dl 811 i in 
the case of m = 10000, respectively. Both kinds of criterion become small if 
k > 8(= n/2). MDL becomes smallest when the value of k is 8(= n/2). while 
AIC becomes smallest when the value of fc is 10. FigODc) shows the relationship 
between the number of samples m and the selected value of k. These results 
show an appropriate value can be selected by MDL. 



A New Approximation Method of the Quadratic Discriminant Function 609 




Fig. 3. Model identification by AIC and MDL. (a) The values of AIC in the case of m = 
10000. (b) The values of MDL in the case of m = 10000. (c) Selected dimensionality 
by AIC and MDL. 



4 Conclusions 

In this paper, we have focused on the parametric density estimation using proba- 
bility density function of multivariate normal distribution. In order to avoid the 
disadvantages of the quadratic discriminant function, we have proposed a new 
approximation method of the quadratic discriminant function. This approxima- 
tion is done by replacing the values of small eigenvalues by a constant which is 
estimated by the maximum likelihood estimation. By applying this approxima- 
tion, a new discriminant function, simplified quadratic discriminant function, or 
SQDF, has been defined. This function not only reduces the computational cost 
but also improves the classification accuracy. 

In order to clarify the effectiveness of SQDF, three types of experiments have 
been carried out. Experimental results of classification using character images 







610 



S. Omachi, F. Sun, and H. Aso 



have clarified that SQDF not only reduces the computational time but also im- 
proves classification accuracy. The second experiment has displayed that SQDF 
gives a good approximation of QDF. The third experimental results have shown 
that the parameter of SQDF can be determined by information criterion. 

Applying SQDF to various pattern recognition problems is a future work. 
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Abstract. Binary classifiers are used in many complex classification problems 
in which the classification result could have serious consequences. Thus, they 
should ensure a very high reliability to avoid erroneous decisions. Unfortu- 
nately, this is rarely the case in real situations where the cost for a wrong classi- 
fication could be so high that it should be convenient to reject the sample which 
gives raise to an unreliable result. However, as far as we know, a reject option 
specifically devised for binary classifiers has not been yet proposed. This paper 
presents an optimal reject rule for binary classifiers, based on the Receiver Op- 
erating Characteristic curve. The rule is optimal since it maximizes a classifi- 
cation utility function, defined on the basis of classification and error costs pe- 
culiar for the application at hand. Experiments performed with a data set pub- 
licly available confirmed the effectiveness of the proposed reject rule. 



1 Introduction 

Many complex classification problems involve binary decisions, since they require to 
choose between two possible, alternative classes. Applications such as automated 
cancer diagnosis, currency verification, speaker identification, and fraud detection fall 
in this category. Their common feature is that the classification result could have seri- 
ous consequences: for this reason, the classifiers with binary outcomes (shortly, binary 
classifiers) used in these situations should ensure a very high reliability to avoid erro- 
neous decisions. Unfortunately, in real world this is rarely the case because, when 
working on real data, the classifiers could easily encounter samples very different 
from those learned during the training phase. In these cases, the cost for a wrong clas- 
sification could be so high that it should be convenient to suspend the decision and call 
for a further test, i.e. to reject the sample. Obviously, such a reject option should be 
defined by taking into account the requirements of the given application domain. 

T his topic has been addressed with reference to multi-class classifiers by Chow in 
[ll|2| . The rationale of the Chow’s approach relies on the exact knowledge of the a 
posteriori probabilities for each sample to be recognized. Under this hypothesis, the 
Chow’s rule is optimal because minimizes the error rate for a given reject rate (or 
viceversa). However, the full knowledge about the distributions of the classes is ex- 
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tremely difficult to obtain in real cases and thus the Chow’s rule is rarely applicable 
“as it is”. An extension to the Chow’s rule when the a priori knowledge about the 
classes is not complete is proposed in and in [0, while in a reject option that 
does not require any a priori knowledge is proposed with reference to a Multi-Layer 
Perceptron. Although effective, these rules are applicable only with multi-class classi- 
fiers. As far as we know, a reject option specifically devised for binary classifiers has 
not been yet proposed. 

The aim of this paper is to introduce an optimal reject rule for binary classifiers, 
based on the Receiver Operating Characteristic curve (ROC curve). ROC analysis is 
based in statistical decision theory and was first employed in signal detection prob- 
lems. It is now common in medical diagnosis and particularly in medical imaging. 
Recently, it has been employed in Statistical Pattern Recognition for evaluating ma- 
chine learning algorithms [|] and for robust comparison of classifier performance 
under imprecise class distribution and misclassification costs [Q. 

In the method here presented the information about the classifier performance pro- 
vided by the ROC curve are employed to build an optimal reject rule. The rule is op- 
timal since it maximizes a classification utility function [/(.), defined on the basis of 
classification and error costs peculiar for the application at hand. Experiments per- 
formed with a data set publicly available confirmed the effectiveness of the proposed 
reject rule. 



2 ROC Curve 

In binary classification problems, a sample can be assigned to one of two mutually 
exclusive classes that can be generically called Positive (P) class and Negative (N) 
class. Let us assume that the classifier provides, for each sample, a value x in the range 
[0,1] which is a confidence degree that the sample belongs to one of the two classes, 
e.g. the class P. The sample should be consequently assigned to the class M if jc — > 0 
and to the class P if a: — > 1. Operatively, a confidence threshold t is usually chosen, so 
as to attribute the sample to the class N if x <t and to the class P if x> t. For a given 
threshold value f, some indices can be evaluated for measuring the performance of the 
classifier. In particular, the set of samples whose confidence degree is greater than t 
contains actually-positive samples correctly classified as “positive” and actually- 
negative samples incorrectly classified as “positive”. It is thus possible to define the 
True Positive Rate TPR{t) as the fraction of actually-positive cases correctly classified 
and the False Positive Rate FPR(t), given by the fraction of actually-negative cases 
incorrectly classified as “positive”. 

If /^r(x) and fp(x) are the density functions of the confidence degree for the 
class N and for the class P, respectively, TPR(f) and FPR(t) are given by (see fig. 1): 

1 1 
TPR{t) = j /p {x)dx FPR(t) = j {x)dx 



( 1 ) 
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In a similar way it is possible to evaluate (taking into account the samples with confi- 
dence degree less than t) the True Negative Rate TNR(t) and the False Negative Rate 
FNR{t), defined as: 

t t 

TNR{t) = j {x)dx = 1 - FPR{t) FNR{t) = j fp {x)dx = 1 - TPR{t) (2) 

0 0 

Since the four indices are not independent, as it is possible to note from eq. (2), the 
pair (FPR(t),TPR(t)) is sufficient to completely characterize the performance of the 
classifier when the decision threshold is set to t. Fig. 1 shows how these quantities can 
be represented on a plane having FPR on the X axis and TPR on the Y axis. 




Fig. 1. The indices TPR, FPR, TNR and FNR evaluated on two bell-shaped confidence densities 
(left). The same quantities mapped on a (FPR, TPR) plane (right). 

When the value of the threshold t varies between 0 and 1 the quantities in eq. (1) and 
eq. (2) vary accordingly, thus defining a set of operating points for the classifier, given 
by the pairs (FPR(t),TPR(t)). The plot of such points gives the ROC curve of the clas- 
sifier (see fig. 2). 




FPR 



Fig. 2. A typical ROC curve. 
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It is worth noting that, when t approaches 0, both TPR(t) and FPR{t) approach 1, while 
the contrary happens when f — > 1. Informally, the nearer the curve to the upper left 
corner of the diagram, the better the performance obtained (higher TPR and/or lower 
FPR). An important reference is given by the line joining the points (0,0) and (1,1) 
which represents the case of a random guessing classifier. 



3 The Reject Option 



When a classifier is used in a real application, its outcomes have consequences to 
which is associated a benefit (in the case of success) or a loss (in the case of error). 
Thus, the effectiveness of the classifier in a given domain should be measured on the 
basis of both its absolute performance (correct classification rate and error rate) and 
the costs associated to the various outcomes. In the case of binary classes, such costs 
can be organized in the cost matrix shown in table 1 . 



Table 1. Cost matrix for a two-class problem 



True 

Class 



Guess Class 





N 


P 


N 


CTN 


CFP 


P 


CFN 


CTP 



In the cost matrix, CTN and CTP are > 0 since related to benefits, while CFN and CFP 
are < 0. 

In general, the cost matrix is not symmetrical, because the consequences of differ- 
ent errors are usually not equivalent. As an example, in the case of medical diagnosis a 
false negative outcome is much more costly of a false positive. Likewise, if the dis- 
ease is rare, a true positive outcome might be much more appraised than a true nega- 
tive outcome. Once the cost matrix has been established on the basis of the particular 
application requirements, it is possible to define a classification utility function U{t) 
which measures, for a given decision threshold, the effectiveness provided by the 
binary classifier: 

[/(f) = p{P) ■ [CTP ■ TPR{t) + CFN ■ FNR{t)] + 
p{N) ■ [CTN ■ TNR{t) + CFP ■ FPR{t)] . 

where p(P) and p(N) are the a priori probabilities of the positive and negative classes, 
respectively. In this way, the optimal decision threshold can be determined as: 

V = argmax[/(f) 

However, there can be real situations in which the cost of an error is so high that it 
is advisable to suspend the decision and to reject the sample if the outcome is consid- 
ered unreliable. The rejection also involves a negative cost (indicated with CR), which 
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is related to the price of a new classification with another system and has smaller 
magnitude with respect to the error costs. 

To accomplish the reject option in a binary classifier, the decision rule for a generic 
sample with confidence degree x should be changed into: 

assign the sample to Ai if x < fj 

assign the sample to P if x > (5) 

reject the sample if fj <x <t^ 

where tj and are two decision thresholds (with fj < t^) fixed so as to maximize the 
utility function. 

As a consequence, the rates defined in eq. (1) and eq. (2) are modified in: 

+00 +00 

TPR{t 2 )- J fp{x)dx FPR{t 2 )~ J/^(x)(ix 

h h 

(6) 

h h 

TNR(ti) - J ff^(x)dx FNR(ti) - J fp(x)dx 

-00 -00 

while the reject rates relative to negative samples, RN(t^,t^), and to positive samples, 
RP(t^,t^), are given by: 

h 

= I /« = 1 - TNltU ^ ) - FPRU 2 ) 

W 

h 

PP(ti , ?2 ) = I /p {x)dx = 1 - TPR{t 2 ) - FNR(t^ ) 
h 

Accordingly, the utility function becomes: 

[/(q,f2) = p{P) ■ CFN ■ FNRQi) + p(N) ■ CTN ■ TNR{t^) + 

p{P) ■ CTP ■ TPR(t2 ) + p{N) ■ CFP ■ FPR(t2 ) + (8) 

p{P) ■ CR ■ RPihdi) + P{N) ■ CR ■ RNit^di) ■ 

If we take into account the relations given in eq. (7), the utility function can be written 
as: 



U{t„t 2 ) = U,{t,) + U 2 it 2 ) + CR. ( 9 ) 

where: 

= p{P) . CFN' ■ PAP(ti) + p(N) ■ CTN' ■ TNR{t^) . ( 10 ) 



U 2 (t 2 ) = p(P) ■ CTP' ■ TPR{t 2 ) + p{N) ■ CFP' ■ FPR{t 2 ) ■ 



( 11 ) 
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and 

CTP' = CTP-CR CFN’ = CFN -CR CTN' = CTN -CR CFP' = CFP-CR 

In this way, the optimal thresholds ( , t 2 „p, ) which maximize U{t^,Q can be 
separately evaluated by maximizing U^(t) and 

ti^p, = arg max p(P) ■ CFN’ ■ FNR(t) + p(N) ■ CTN’- TNR{t) . ( 12 ) 

t 

t 2 „p, = arg max p(P) ■ CTP’ ■ TPR(t) + p(N) ■ CFP’ ■ FPR{t) . ( 13 ) 

/ 

By taking into account the relations introduced in eq. (2), the optimization problem in 
eq. (12) is equivalent to; 

t^^p, = arg min p{P) ■ CFN ’ ■ TPR{t) + p(N) ■ CTN ’ ■ FPR{t) . ( 14 ) 

t 

It is worth noting that the objective functions in eq. (14) and eq. (13) define on the 
ROC plane two sets of level curves having parametric equations: 

p{P) ■ CFN’ ■ TPR(t) + p(N) ■ CTN’- FPR(t) = . ( 15 ) 

and 

p{P) - CTP’ - TPR(t) + p(N) - CFP’ - FPR(t) = k2 . ( 16 ) 

Each set is composed by parallel straight lines. The slopes associated to the sets are, 
respectively: 

p{N)-CTN’ p(N)-CFP’ ( 17 ) 

mi = m-, = . 

p(P)-CFN’ p(P)-CTP’ 

Since the set of feasible points for both the objective functions is given by the ROC 
curve, the optimal threshold can be determined by searching the point on the 

ROC curve belonging also to the line defined by eq. (15) which intersects the ROC 
and has minimum k^-ln a. similar way can be found t 2 opt , with the only difference 

that we must consider the line that satisfies eq. (16), intersects the ROC curve and has 
maximum k 2 ■ It can be simply shown that, in both cases, the searched line is the level 
curve that intersects the ROC and has largest TER-intercept. Such a line lies on the 
ROC Convex Hull |0, i.e. the convex hull of the set of points belonging to the ROC 
curve (see fig. 3). 

In particular, the line could share with the ROC convex hull only one point (a ver- 
tex of the convex hull) or an entire edge. In the first case, the optimal threshold is 
given by the value of t associated to the point. In the second case, either of the two 
vertices of the segment can be chosen; the only difference is that the left vertex will 
have lower TPR and FPR, while the right vertex will have higher TPR and FPR. 
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FPR 

Fig. 3. A ROC curve with its convex hull. 

To give an operative method for finding the optimal thresholds, let us call 
y„, Vj, the vertices of the ROC convex hull, with V„=(0,0) and V,^=(l,l); 

moreover, let ij be the slope of the edge joining the vertices 17 , and V, and assume that 
= oo and that = 0. If m is the slope of the level curve of interest, the list {jJ 
should be searched to find a value such that = m or > m > in the first case, 
the level curve and the edge are coincident and thus either of the vertices and 
can be chosen. In the second case, the level curve touches the ROC convex hull in the 
vertex V^., which provides the optimal threshold. 

It is important to recall that must be less than t 2 opi to achieve the reject option. 

For this reason, the slopes and defined in eq. (17) must be such that < m^, 
otherwise the reject option is not practicable. 



4 Experimental Results 

For testing the proposed reject rule, a medical dataset (the Pima Indians Diabetes 
dataset), publicly available from the UCI Machine Learning Repository [^, has been 
considered. This dataset involves the diagnosis of diabetes diseases on the basis of the 
results of several tests. The data were collected by the National Institute of Diabetes 
and Digestive and Kidney Diseases. All of the patients were females at least 21 years 
old of Pima Indian heritage. The class variable has the values 0 (healthy) and 1 (dia- 
betes). The dataset contains 768 labeled cases (500 healthy and 268 diabetes), each 
including 8 continuously valued inputs. 

The classifier adopted is a Multi Layer Perceptron with 8 input units, 4 hidden units 
and 1 output unit, implemented in C language using the SPRANNLIB library [^. The 
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network has been trained for 20,000 epochs using the back propagation algorithm with 
a learning rate of 0.01 and a momentum of 0.2. The set used for the training contained 
the 80% of the samples of the whole dataset. The remaining 20% were split into two 
different sets, the first one for evaluating the ROC curve, while the second one was 
adopted as test set. 




FPR 

Fig. 4. The obtained ROC curve together with its convex hull. 

In fig. 4, the ROC curve obtained is shown together with its convex hull. The coor- 
dinates of the vertices with the respective threshold values and the slopes of the edges 
of the ROC convex hull are listed in tables 2 and 3. 



Table 2. The ROC convex hull vertices 





ROC convex hull 
Vertices 


t 


0 


(0.00 0.00) 


1.00 


1 


(0.02 0.27) 


0.90 


2 


(0.10 0.50) 


0.70 


3 


(0.18 0.65) 


0.30 


4 


(0.24 0.73) 


0.25 


5 


(0.38 0.88) 


0.20 


6 


(0.58 0.96) 


0.10 


7 


(1.00 1.00) 


0.00 



Table 3. The ROC convex hull slopes 





ROC Convex hull 
Edge Slopes 


0 


oo 


1 


13.45 


2 


2.89 


3 


1.92 


4 


1.28 


5 


1.10 


6 


0.38 


7 


0.09 


8 


0.00 



The costs considered for the experiments are shown in table 4: seven different cost 
combinations (denoted by a-g) have been chosen which reflect different situations. 
The reject cost has been assumed constant. 
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Table 4. The combinations of costs 



Table 5. The slopes and the thresholds 





CFN 


CFP 


CTN 


CTP 


CR 


a 





-25 


200 


400 


-12.5 


b 





-25 


100 


200 


-12.5 


f 





-25 


511 


100 


-12.5 


<1 





-25 


25 


5£) 


-12.5 




-100 


-50 


25 


5£) 


-12.5 


f 


-200 


-100 


25 


5£) 


-12.5 


a 


-400 


-200 


25 


511 


-12.5 





mi 


m2 


^lopt 


^2opt 




10.570 


0.057 


_ 


_ 


b 


5.596 


0.110 


_ 


_ 


c 


3.109 


0.207 


_ 


_ 


(1 


1.865 


0.373 


_ 


_ 


e 


0.799 


1.119 


JL20 


JI25 


f 


0.373 


2.611 


jun 


jm 


_ 2 _ 


0.181 


5.596 


iun 


_im 



Table 5 shows the values for tWj and evaluated for each cost combination. It is 
worth noting that the reject option is achievable only in the last three cases, where 
tMj < m^. The relative optimal thresholds, which can be deduced by looking at the 
tables 2 and 3, are reported in the last two columns. As an example, figure 5 shows the 
optimal level curves of and for the cost combinations e and g. 




FPR 




FPR 

Fig. 5. The optimal level curves of and for the cost combinations e (above) and g (below). 
The optimal points on the ROC convex hulls are also highlighted. 
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Table 6 resumes the results obtained on the test set with and without the reject op- 
tion. The first six columns contain the rates obtained: these values are costant in the 
rows a-d because the optimal point on the ROC for the utility function without reject 
is given by FPR = 0.38 and TPR = 0.88 for all the cost combinations. 



Table 6. Results obtained on the test set 





FPR 


TPR 


FNR 


TNR 


RP 


RN 


U 


Urei 


ft 


0.38 


0.88 


0.62 


0.12 


_ 


_ 


196.079 


_ 


b 


0.38 


0.88 


0.62 


0.12 


_ 


_ 


93.944 


_ 


c 


0.38 


0.88 


0.62 


0.12 


_ 


_ 


42.876 


_ 


<1 


0.38 


0.88 


0.62 


0.12 


_ 


_ 


17.343 


_ 




0.27 


0.70 


0.15 


0.55 


jm 




9.151 


16.125 


f 


0.12 


0.45 


0.06 


0.39 


JI49 


TL42 


-7.231 


-4.000 


g 


0.04 


0.24 


0.06 


0.39 


0.70 


0.57 


-39.996 


-26.125 



In the rows e-g, where the reject option is possible, the rates are those given by 
tigpi and t 2 gpt ■ It is also possible to observe how the value of the utility obtained with 

the reject option (last column) is sensibly better than in the case of classification with- 
out reject option (seventh column). 
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Abstract. The present work discusses what have been called ’imperfectly 
supervised situations’: pattern recognition applications where the assumption of 
label con'ectness does not hold for all the elements of the training sample. A 
methodology for contending with these practical situations and to avoid their 
negative impact on the performance of supervised methods is presented. This 
methodology can be regarded as a cleaning process removing some suspicious 
instances of the training sample or correcting the class labels of some others 
while retaining them. It has been conceived for doing classification with the 
Nearest Neighbor rule, a supervised nonparametric classifier that combines 
conceptual simplicity and an asymptotic error rate bounded in terms of the 
optimal Bayes error. However, initial experiments concerning the learning phase 
of a Multilayer Perceptron (not reported in the present work) seem to indicate a 
broader applicability. Results with both simulated and real data sets are 
presented to support the methodology and to clarify the ideas behind it. Related 
works are briefly reviewed and some issues deserving further research are also 
exposed. 

Keywords: Supervised methods, Nearest neighbor classifier, learning, 
depuration methodology, generalized edition. 



1. Introduction 

Traditionally, pattern recognition methods have been sorted into two broad groups: 
supervised and unsupervised, according to the level of previous knowledge about the 
training sample identifications in the problem at hand. Much of the research work in 
the frame of supervised pattern recognition has been almost entirely devoted to the 
analysis of the characteristics of classification algorithms and to the study of feature 
selection methods. Recently, however, an increasing emphasis is being given to the 
evaluation of procedures used to collect and to clean the training sample, a critical 
aspect for effective automatization of discrimination tasks. 

* Work partially supported by Project Conacyt 32016-A and by Project Cosnet 744.99P 

F.J. Feni et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 621-630, 2000. 
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Supervised classifiers' design is based on the information supplied by the training 
sample (TS), a set of training patterns or prototypes representing all relevant classes 
and with correct classes labels. In several practical applications, however, class 
identification of prototypes is difficult and very costly and, as a consequence, some 
imperfectly or incorrectly labeled prototypes may be present in the TS, leading to 
situations lying in between supervised and unsupervised methods, or as they have been 
called: imperfectly supervised pattern recognition situations. Examples have been 
reported in medical diagnoses, drawing of pronostic maps of mineral deposits and, 
particularly, in the interpretation of remotely sensed data. In this later domain, training 
field selection and the yielding of suitable training statistics have been the concern of 
several researchers and practitioners, e.g. [1, 8, 10, 19, 29, 40]. Foody [17, 18] 
discusses to some extent the difficulties introduced into the classification process by 
those prototypes representing more than one class (e.g., those allocated on the border 
between classes) or being members of a class not considered when collecting the TS. 
Mather [32] refers to atypical elements in the TS that may belong to another class or 
may be hibrid or mixed elements. 

The Nearest Neighbor (NN) rule [12] is a supervised nonparametric classifier, 
whose application does not require any assumption about probabilistic density 
functions. Some of its main features are described in the next Section. The 
performance of this classifier, as with any nonparametric method, is extremely 
sensitive to incorrectness or imperfections of the training sample. The present work 
introduces a methodology for decontaminating imperfect TSs while employing the NN 
rule for classification. This methodology can be regarded as a cleaning process 
removing some suspicious elements from the TS, or correcting the labels of some 
others and retaining them. Although conceived specifically for the NN rule, initial 
experiments with a Multilayer Perceptron seem to indicate a broader applicability. 
Results with both simulated and real data sets are presented to support the 
methodology and to clarify the ideas behind it. Related works are briefly reviewed 
and some issues deserving further consideration are also exposed. 



2. The NN Rule 

This is a classifier that combines conceptual simplicity and an asymptotic error rate 
bounded in terms of the optimal Bayes error. Let TS={(Xj ,(pj ), (x^ ,(p 2 ), ..., (x_j,(p_^)} be 
the training sample. That is, TS is the set of n pairs of random samples (x^ ,(p| ) 
(i=l,2,...,n), where the label (p may take values in {1,2,. ..c} and (p^ designates the class 
of X among the c possible classes. For classifying an unknown pattern X with the NN 
rule, it is necessary to determine first the nearest neighbor x' of X in the TS. That is, it 
is necessary to find x' in TS such that: 

d(X,x') = mind(X,x) x e TS (1) 

where d( , ) means any suitable metric defined in the feature space. Then, the pattern 
X is assigned to the class identified by the label associated to x'. Devijver and Kitler 
[16] have expressed that "the basic idea behind the NN rule is that samples which fall 
close together in feature space are likely to belong to the same class". A more 
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graphical description is due to [14]: "it is like to judge a person by the company he 
keeps". 

Two other peculiarities of the NN rule have contributed to its popularity: a) easy 
implementation, and b) known error rate bounds. The computational burden of this 
classifier, very high with brute-force searching methods, has been considerably cut 
down by developing suitable data structures and associated algorithms (e.g., [25]) or 
by reducing the TS size (e.g., [2, 26]). 

For improving the NN rule's performance, Wilson [43] proposed a procedure 
(Edition technique) to preprocess the TS. The algorithm has the following steps: 

1. For every x in TS, find the k (k=3 has been recommended) nearest neighbors of 
among the other prototypes, and the class associated with the larger number of 
patterns among these k nearest neighbors. Ties would be randomly broken 
whenever they occur. 

2. Edit the TS by deleting those prototypes x whose identification label does not agree 
with the class associated with the largest number of the k nearest neighbors as 
determined in the foregoing. 

The benefits of the Edition technique have been supported by theoretical and 
empirical results (e.g., [3]). On the other hand, concerned with the possibility of 
considerable amounts of prototypes removed from the TS, Koplowitz and Brown [31] 
developed a modification of this technique, the Generalized Edition (GE). Here, for a 
given value of k, another parameter k' is defined such that: 

(k-H 1) / 2 < r < k (2) 

For each prototype Xj in TS its k nearest neighbors are searched in the remainder of 
TS. If a particular class has at least k' representatives among these k nearest 
neighbors then Xj is labeled according to that class, independently of its original label. 
Otherwise, x is edited (removed). In short, the procedure looks for modifications of 
the training sample structure through changes of the labels (re-identification) of some 
training patterns and removal (edition) of some others. 

Although none of these two techniques was particularly aimed at facing 
contaminated training samples, both modify basically the structure of the TS and, 
therefore, were included in the empirical evaluation to be explained in the Section 4. 



3. Incorrections in the Training Sample and Related Works 

Traditional approaches to supervised pattern recognition imply the fulfillment of 
two basic asumptions concerning training samples in order to guarantee accurate 
identification of new patterns: 

1. the set of c classes with representations in the training sample span the entire 
pattern space 

2. the training patterns used to teach the classifier how to discriminate each class are 
actually members of that class. 
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Practical experience has shown that in many real applications one or both of these 
assumptions do not entirely hold and that violation of these requirements strongly 
degrade classification accuracy. In accordance with this perception, the number of 
papers and proposals for handling this subject has significantly increased in the last 
years. 

Outlier data is a concept that has been considered in Statistics for some time. It has 
been defined [41] as: a case that does not follow the same model as the rest of the 
data. Now the term has come to the fore also in the Pattern Recognition and Data 
Mining areas. Reports about the effect of these "noisy" patterns when included in the 
training sample and how to counteract it have been published in [20, 28, 34, 38] 
among others. Even for unsupervised methods outlier data have been the concern of 
several researchers, e.g., [22, 30]. 

However, this term is being employed to cover a broad range of circumstances 
reflecting some confusion among disimilar situations and a lack of a rigurous and 
unified concept of outlier data. In general, there are three of these potential situations: 

1. noisy or atypical data that can be produced by errors (measuring, capturing, etc), an 
unfortunate property of many large databases. 

2. New unidentified patterns appearing in the classification phase and that do not 
belong to any of the classes represented in the TS (partially exposed environments: 
[13, 33]). These cases are usually handled by a reject option [15, 23]. 

3. Some authors employ the term outlier for denoting mislabed instances in the TS, 
what constitutes the main focus of the present work. Brodley and Friedl [9] employ 
a combination of classifiers to filter the training patterns looking for identification 
and elimination of wrong labeled training cases prior to applying the chosen 
learning algorithm. Basic differences with the procedure presented in the next 
section: 

i) they do not consider correcting labes of some contaminated training data 

ii) although they use real data for demonstration purposes, they modified 
intentionally the labels of some training patterns to simulate a situation that 
they state as very frequent in these applications (remotely sensed data). 

iii) The filtering method they propose cannot fully overcome the error in the 
data for noise levels of 30% or greater 



4. Depuration of Training Samples 

The NN rule, as any other nonparametric pattern recognition method, suffers from 
an extreme sensitivity to the presence of contaminated training sets [36]. Hence, the 
importance of building procedures to contend with imperfectly supervised 
environments. Barandela [2] reports a considerable amount of Monte Carlo 
experiments to assess some procedures usable for decontaminating training samples. 
The procedures included were: a) Edition b) Generalized Edition c) Mutual 
Neighborhood [21] d) Reidentification (after some ideas of Chitinenni [11]) and e) All 
k-NN, a variant of the Edition technique proposed in [37]. 
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For comparing these procedures, experiments with simulated patterns from two 
Gaussian populations with different mean vectors and equal covariance matrix were 
carried out. In every experiment, the TS consisted of 200 prototypes and the 
independent test sample (used for validation purposes) contained 500 elements, always 
a half from each class. Five different levels (percentages) of members of the second 
class were wrong labeled as belonging to class 1. For each of these cases and 
combinations, 30 replications were done. The averaged results, misclassification 
percents on the test set, are shown in Table 1. The effect of Generalized Edition on the 
contaminated training samples was remarkable and its superiority over the rest of the 
evaluated methods was recorded in more than 98% of the individual replications. 



Table 1. Comparison of methods for decontamination (averaged misclassification rates). 





Wrong labels in class 1 


5% 


15% 


25% 


35% 


45% 


Original TS 


15.0 


17.8 


20.4 


22.9 


25.4 


Generalized Edition 


10.2 


10.4 


11.5 


12.4 


14.3 



It should be noted that for the decision about the labeling of a prototype (and 
perhaps the transference to another class or relabeling) or about its edition (elimination 
from the TS), GE takes into account the labels of the k nearest neighbors of this 
prototype. That is, for evaluating the correctness of this training pattern label, the 
procedure is based on the information supplied by the labels of other prototypes 
which, in turn, can be incorrectly identified. Erom the results in Table 1 it is clear that 
the greater the percent of initially mislabeled prototypes the greater the percent of test 
patterns erroneously classified after the GE application. Nevertheless, the method 
produces an unquestionable improvement in the classifier's performance, indicating 
the achievement of a TS structure with an appreciably reduced amount of wrong 
labels. This situation seemed to indicate that the reiteration or repetition of the 
procedure is convenient, because in every further application the environment will be 
less contaminated [2]. The idea was put into practice with the same simulated data 
sets, yielding results that are showed in Table 2. 

It was found unnecessary to carry out more than three succesive applications. 
Already after the second application and more clearly after the third one, 
misclassification rates tend to reach stability, notwithstanding (and that confirms how 
proper this proposition is) the original amount of wrong labels. In this third application 
very few remotions and no transferences at all are produced. These results and those 
of the practical applications below indicate that the procedure will always stop after a 
finite number ot iterations. 
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Table 2. Reiteration of Generalized Edition - averaged error rates. 



Wrong 
labels in 
Class No. 1 


Original 
Tr. Sample 


G 

1st app. 


eneralized Editi 
2nd app 


on 

3rd app 


5% 


15.0 


10.2 


10.1 


10.1 


15% 


17.8 


10.4 


10.1 


10.1 


25% 


20.4 


11.5 


10.2 


10.2 


35% 


22.9 


12.4 


10.5 


10.3 


45% 


25.4 


14.3 


10.8 


10.6 



Practical experiences with several real data sets in the geophysical domain [5] led 
to a novel methodology for preprocessing and Depurating the TS. The Depuration 
methodology involves several applications of Generalized Edition until stability has 
been reached in the structure of the TS and in the estimated error rate (leave-one-out), 
and then the application of Wilson's Edition, eventually also reiterated. It has been 
observed that after the first or second application of Generalized Edition, prototypes 
elimination stops and transferences (relabelling) number decreases gradually. From 
the second application onwards these transferences only affect a part of those 
prototypes whose labels had been changed in the first iteration, producing an 
oscillatory movement with some prototypes being passed from one class to another 
and backwards. At each iteration, this "pendulum" effect influences upon an ever 
decreasing number of prototypes until a steady situation with no more movements is 
reached. At this point Edition has produced elimination only of those prototypes (or a 
part of them) that had remained oscillating until the last iterations. These and other 
characteristics of the Depuration methodology will become more distinct after a brief 
outline of some of the mentioned practical applications. 

Application No. 1 [4] Six features had been measured on a set of 268 unlabeled 
patterns (strata belonging to a well log data). A clustering method sorted the data set 
into 4 groups that, since the ultimate purpose was to explore possibilities of gas-oil 
manifestations, have been regarded as: 

- classes 1 and 3, conformed by nonprospective strata 

- classes 2 and 4,presenting strong association with the prospective strata of the area 

This configuration was taken as the initial TS to be used afterwards for classifying 
additional strata, building up a semisupervised classification system ([24], [27]). 
Obviously, this TS should be assumed as imperfectly supervised since, by the very 
nature of clustering algorithms some prototypes have received unreliable labels. Hence 
the necessity to employ Depuration process for structure improvement. Generalized 
Edition was applied twice and the second application produced no remotions and only 
a few transferences (about 4%). On the other side, the error estimate decreased more 
than 75%. Edition got also a remarkable improvement in the performance, yet with a 
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small size reduction. As a final task, the processed TS was used to classify, with the 
NN rule, another 628 strata coming from three different well logs. Resulting 
assignations were verified using ancillary information and evaluated as more accurate 
than those got when employing the whole original TS. 

Application No. 2 [35] This work was aimed to characterize the stratigraphic 
sequences present in an oil deposit, on the basis of an evaluation of several 
petrophysical and geophysical parameters, adding up to 12 features. A TS with 139 
prototypes belonging to six classes was available. Peculiarities of the area under study 
gave place to vagueness and ambiguity in the identification of the prototypes, while 
skill and intuition of the interpreter played a decisive role in the decisions. 
Consequently, the TS at hand was adopted as imperfectly supervised, requiring a 
Depuration process. Most of the resulting transferences affected prototypes located on 
the borders between classes, shifting them up or down. As already explained, accuracy 
of these borders definitions had been mistrusted. Geophysicists in charge of the study 
of the area accepted the depurated TS as more consistent and more convenient for 
modeling purposes than the original one. 

Application No. 3 [6] Data employed in this case came from a previous work [39], 
aimed to study the processes conditioning the shape of an ophiolitic sequence. The 
collected sample consisted of 187 prototypes sorted into four classes (according to the 
lithology). Seven features were recorded for each pattern. Geophysicists well 
acquainted with the data and with the procedures employed to collect and prepare 
them, regarded this TS as perfectly supervised and pronounced themselves strongly 
against any modification. When they accepted to collaborate in the evaluation of the 
Depuration process (see Table No. 3) merely with exploratory purposes, they 
demanded severe restrictions about amount and type of transferences to be allowed. 
Nevertheless, after the process was applied, these same domain experts accepted the 
depurated TS without hesitations as more accurate and better structured than the 
original one. Here again, an important part of the transferences involved prototypes 
located in the borders between different lithologies (classes). Besides, no undesirable 
transferences were recorded or occured in a very low level. 



Table 3. Practical geophysical applications 





Applic. 1 


Applic. 2 


Ap 


plic. 3 


Procedure 


No. 

Patt. 


Err est. 
(%) 


No. Patt. 


Err est. 
(%) 


No. Patt. 


Err est. 
(%) 


Original TS 


268 


34.0 


139 


42.4 


187 


52.9 


GE (repeated) 


130 


1.5 


136 


2.9 


134 


6.7 


Edition (repeated) 


122 


0.0 


117 


0.0 


105 


0.0 



Concerning these applications, some issues may be remarked. Firstly, Depuration 
has evidenced its benefits in the three possible environments: unsupervised (as a 
complement to the cluster algorithm), imperfectly supervised, and supervised. In this 
last case, success of Depuration could be explained by the lack of a precise and well 
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defined distinction (physical and conceptual) among classes, a rather common 
situation in the praxis, at least in the geoscience fields. As a byproduct, the Depuration 
process yields a significant reduction of the TS size and, accordingly, of the 
computational time required for subsequent works involving these data. With the 
exception of Application No. 2 when the low ratio Dimensionality /TS-size compelled 
the elimination of superfluous features at the very beginning of the process, feature 
selection was easier when implemented as an intermediate step. Employment of 
estimate L for error probability as a guide for conducting the process showed itself as 
suitable. 



5. Discussion 

Wilkinson et al. [42] mention adequacy of the training data as one of the factors 
dictating the performance of any classifier and manifest that it is "very often outside 
the control of the data analyst". The procedure here exposed evidences that an 
important contribution can be done for amending some defficiencies of the available 
TS and to increase its usefulness. The Depuration process has revealed, for both the 
simulated and the real data applications, significant benefits. Although the above 
explained real examples have been about classification of geophysical data, it should 
be clear that the procedure is independent of the origin of the data set in question and 
can be used in any application involving supervised classification. Actually, an 
application with remotely sensed data has already been reported [7]. 

Importance of this decontamination issue deserves further research. Efforts should 
be concentrated on the feasibility and the convenience of developing pertinent 
methodologies according to the different possible causes for contamination in the TS. 
The Depuration procedure manages quite well situations as those already highlighted 
in the applications above: misidentification of some prototypes due to the difficulty 
and high cost of collecting the data, or to some characteristics of the application 
domain that induce vagueness and a lack of clear separation among classes. Inclusion 
in the TS of prototypes belonging to not considered (untrained) classes and of mixed 
prototypes (representing more than one class) would require additions to or 
modifications of the procedure. Implementation of a classifier with a reject option and 
fuzzy approach (as in [19]) could be useful in this respect. What seems to be evident is 
that methodologies like the one here exposed lead to computer systems for 
classification tasks, not only faster than human interpreters. It has also evidenced 
ability to yield more accurate classification results and, at the same time, to provide a 
better model for the phenomenon under study. The later advantage is got through the 
relabeling of some of the training patterns. 
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Abstract. To improve weak classifiers bagging and boosting could be used. 

These techniques are based on combining classihers. Usually, a simple majority 
vote or a weighted majority vote are used as combining rules in bagging and 
boosting. However, other combining rules such as mean, product and average 
are possible. In this paper, we study bagging and boosting in Linear 
Discriminant Analysis (LDA) and the role of combining rules in bagging and 
boosting. Simulation studies, carried out for two artificial data sets and one real 
data set, show that bagging and boosting might be useful in LDA: bagging for 
critical training sample sizes and boosting for large training sample sizes. In 
contrast to a common opinion, we demonstrate that the usefulness of boosting 
does not directly depend on the instability of a classifier. It is also shown that the 
choice of the combining rule may affect the performance of bagging and 
boosting. 

1 Introduction 

When the training sample size is small compared to the data dimensionality, the 
training data may often give a distorted representation of the real data distribution. A 
classification rule, constructed on such training data, may be biased and have a large 
variance. Consequently, one can get a lousy classifier, having a poor performance [1]. 
In order to improve a weak classifier by stabilizing its decision, a number of 
techniques could be used, for instance, regularization [2] or noise injection [3]. 
Another approach, which allows us to improve a weak classifier, consists in combining 
classifiers, obtained on the modified versions of the original training set (e.g., by 
sampling [4] or weighting). This approach is implemented in bagging [5] and boosting 
[6], however, in different ways. In bagging, one samples the training set, generating 
random independent bootstrap replicates of the training set, constructs the classifier on 
each of these bootstrap replicates and aggregates them by simple majority vote in the 
final decision rule. In boosting, classifiers are constructed on weighted versions of the 
training set, which are obtained sequentely in the algorithm. Initially, all objects have 
equal weights, and the first classifier is constructed on this data set. Then weights are 
changed according to the performance of the classifier. Erroneous classified objects get 
larger weights and the next classifier is boosted on the reweighted training set. In this 
way a sequence of training sets and classifiers is obtained, which are then combined by 
the simple majority or the weighted majority vote in the final decision. 

As a rule, bagging and boosting are applied to classification and regression 
trees (CART) [7], [8], [9], [10], where it is difficult to imply other combining rules than 
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simple majority vote or weighted majority vote. However, bagging and boosting may 
also perform well in linear discriminant analysis [11], [12]. Linear classifiers allow us 
to use other combining rules such as the average (when the final classifier is obtained 
by averaging the coefficients of the combined classifiers), the mean (when decision is 
made according to the mean of posteriori probabilities given by the combined 
classifiers) and the product rule (when decision is made by the product of posteriori 
probabilities presented by the combined classifiers). It may happen that these 
combining rules perform better than majority vote, especially for bagging. Moreover, 
the average rule has an advantage to other combining rules, because it requires to keep 
only the coefficients of classifiers instead of all posteriori probabilities of each 
combined classifier. 

In this paper we investigate the role of five mentioned combining rules (simple 
majority vote, weighted majority vote, average, mean and product) for bagging and 
boosting in linear discriminant analysis. The Nearest Mean Classifier (NMC) [13], also 
known as the Euclidean distance classifier, is used in our study. This choice is made 
because the NMC is often a weak, unstable classifier for data sets having an other 
distribution than Gaussian with equal variances. Therefore, bagging and boosting, 
which we recite in the next section, may be useful in order to improve it’s 
performance. To perform our simulation study, we have chosen two artificial data sets 
and one real data set, which are described in section 3. The artificial data sets present a 
2-class problem. The real data set consists of a 2-class problem and a 4-class problem. 
The results of our simulation study are presented in section 4. Conclusions are 
summarized in section 5. 

2 Bagging and Boosting 

In order to improve the performance of unstable regression or classification 
rules, a number of combining techniques can be used. In recent years, the most popular 
ones became bagging and boosting. They both modify the training data set, build 
classifiers on these modified training sets and then combine them into a final decision 
rule by simple or weighted majority vote. However, they perform it in a different way. 

Bagging is based on the bootstrapping [4] and aggregating concepts and 
presented by Breiman [5]. Both, bootstrapping and aggregating may be beneficial. 
Bootstrapping is based on random sampling with replacement. Therefore, taking a 
b b b b 

bootstrap replicate X = (X^, X^, ..., X^ (the random selection with replacement 
of N objects from the set of N objects) of the training sample set 
X = (X^,X 2 , ...,X^), one can sometimes avoid or get less misleading training 
objects in the bootstrap training set. Consequently, a classifier constructed on such 
training set may have a better performance. Aggregating actually means combining 
classifiers. Often a combined classifier gives better results than individual classifiers, 
because of combining in the final solution advantages of the individual classifiers. 
Therefore, bagging might be helpful to build better classifier on training sample sets 
with misleaders. In bagging, bootstrapping and aggregating techniques are 
implemented in the following way. 

I. Repeat for b=l,2,...,B. 
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a) Take a bootstrap replicate ^ of th^ training data set X . 

b) Construct a classifier C{X ) on X . 

2. Combine classifiers by simple majority vote to a final decision rule. 

Boosting, proposed by Freund and Schapire [6], is another technique to 
combine unstable and weak classifiers in order to get a classification rule with a better 
performance. In contrast to bagging, where bootstrap training sets and classifiers are 
independent and random, in boosting, classifiers and training sets are obtained in a 
strictly deterministic way. Both, training data sets and classifiers are obtained 
sequentely in the algorithm. At each step, training data are reweighted in such way that 
incorrectly classified objects get larger weights in a new modified training set. By that, 
one actually maximizes margins between training objects. It suggests the connection 
between boosting and Vapnik’s Support Vector Classifier (SVC) [7], [14]. Boosting is 
organized in the following way. 

1. Repeat for b=\,2,...,B. ^ 



a) Construct 

t * V 



Z* 



the classifier C (Z*) on 

w .Z 



data set 



^ 2 ^ 2 - 



b) 
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c) Set vv7 ' = w- exp(—c^^- and renormalize so that ^ 

. i = 1 



b + \ 



2. Combine classifiers C (Z*) by weighted majority vote with weights c, to a final 



decision rule. 



3 Data 



Two artificial data sets and one real data set are used for our experimental 
investigations. 

• The first set is a 30-dimensional correlated Gaussian data set constituted by 
two classes with equal covariance matrices. Each class consists of 500 vectors. The 
mean of the first class is zero for all features. The mean of the second class is equal to 
3 for the first two features and equal to 0 for all other features. The common covariance 
matrix is a diagonal matrix with a variance of 40 for the second feature and a unit 
variance for all other features. The intrinsic class overlap (Bayes error) is 0.064. In 
order to spread the separability over all features, this data set is rotated using a 30 ^ 30 



rotation matrix which is 



1 -1 
1 1 



for the first two features and the identity matrix for all 



other features. We call these data further “Gaussian correlated data”. Its first two 
features are presented in Fig. I. 

• The second data set consists of two 8-dimensional classes. The first two features 
of the data classes are uniformly distributed with unit variance spherical Gaussian 
noise along two 2 ti/ 3 concentric arcs with radii 6.2 and 10.0 for the first and the second 
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Fig. 1. The scatter plot of a two-dimensional projection of the 30-dimensional Gaussian 
correlated data 

class respectively. 

rPiCos(Yi) + ^i^ ^(2) /^P2 Cos(Y2) + ^3^ 

v(i) |^PjCos(Yi) + y’ Ip2Cos(Y2) + ^4/ 

2 y \ 2 j 

where pj = 6.2, P 2 = 10, ~ 1V(0, 1) , Y^ ~ M = 1,2. The other 

six features have the same spherical Gaussian distribution with zero mean and variance 
0. 1 for both classes. Both classes consist of 500 objects each. We will call these data 
“banana-shaped data” (BSD). Its first two features are presented in Fig. 2. 



X™ 



Fig. 2. Scatter plot of a two-dimensional projection of the 8-dimensional banana-shaped data 
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Fig. 3. Scatter plot of the hrst two principal components of the pump data with four classes 



• The last data set consists of the measurements obtained from four kinds of 
water-pump operating states: a normal behaviour (NB), a bearing fault (BF) (a fault in 
the outer ring of the uppermost single bearing), an imbalance failure (IF) and a loose 
foundation failure (LFF), where the running speed (46, 48, 50, 52 and 54 FIz) and 
machine load (25, 29 and 33 KW) are varied. To measure pump vibrations, a ring 
accelerometer is used. For the obtained time series of the pump vibration patterns, we 
determined the coefficients of an order 128 autoregressive model. The 128 coefficients 
of this model are used as the features describing the pump vibration patterns. For each 
operating mode 15 128-dimensional vectors are obtained, which are normalized w.r.t. 
the mean and the standard deviation. Then the data are combined either in 4 classes (a 
normal behaviour, a bearing fault, an imbalance failure and a loose foundation failure) 
consisted of 225 observations each, or in 2 classes (the normal behaviour and the 
abnormal behaviour). In the latter case, the normal behaviour class consisted of 225 
observations the abnormal behaviour class consisted of 675 observations representing 
all three failures: bearing, imbalance and loose foundation. The first two principal 
components of the autoregressive model coefficients for four operating states are 
presented in Fig. 3. We call these data “pump data" in the experiments. 

Training data sets with 3 to 400 (with 3 to 200 for pump data) samples per class 
are chosen randomly from a total set. The remaining data are used for testing. These 
and all other experiments are repeated 50 times for independent training sample sets. 
In all figures the averaged results over 50 repetitions are presented and we do not 
mention that anymore. 

The standard deviations of the mean generalization errors of the NMC, the 
bagged NMC, the boosted NMC and the SVC were of the similar order for each data 
set. When increasing the training sample size, they were decreasing approximately 
from 0.015 to 0.005, from 0.015 to 0.009, from 0.01 to 0.008 and from 0.02 to 0.01 for 
30-dimensional Gaussian correlated data, for 8-dimensional banana-shaped data and 
for 128-dimensional pump data with 4 and 2 classes, respectively. 
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4 The Effect of the Comhining Rule in Bagging and Boosting 

Let us now study the usefulness of bagging and boosting in LDA on the 
example of the NMC and look at the effect of the combining rule on their performance. 
In order to understand better, when bagging and boosting might be beneficial, it may 
be useful to consider the instability of a classifier [11]. The instability of a classifier is 
measured by us by calculating the changes in classification of a training data set 
caused by the bootstrap replicate of the original learning data set. Repeating this 
procedure several times on the training set (we did it 25 times) and averaging the 
results an estimate of the classifier instability is obtained. The mean instability of the 
NMC (on 50 independent training sets) defined in this way is presented in Fig. 4 for 
the data sets described in the previous section. One can see that the instability of the 
NMC is distinct for different data sets. However, for all data sets, the classifier is the 
most unstable when the training sample size is small. Then the instability of the 
classifier decreases as the training sample size increases. 





a) b) 





Fig. 4. The instability of the NMC for 30-dimensional Gaussian correlated data (a), 8- 
dimensional banana-shaped data (b), 128-dimensional pump data with 2 classes (c) and 4 
classes (d) 
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Simulation results obtained on the 30-dimensional Gaussian correlated data 
(see Fig. 5) show that bagging and boosting are very useful for the NMC on this data 
set. Bagging improves almost twice the generalization error of the NMC for critical 
training sample sizes, when the data dimensionality is comparable with the number of 
training objects, and the classifier is unstable. When training sets are very small, often 
they represent the distribution of the entire data set incorrectly. Bootstrapping such 
training sets, one can hardly get a better training set. Therefore, bagging is useless for 
very small training sample sizes. When the training sample size is large, the classiher 
is stable. Large training sets represent the distribution of the entire data accurately. 
Therefore, perturbations in the composition of the training set do not change the 
training set very much. By this reason, bagging is useless for large training sample 
sizes. One also can see that the performance of bagging is strongly affected by the 
choice of the combining rule. Bagging with the simple majority vote rule, which is 



Bagged NMC (B=250) 




Boosted NMC (B=250) 




Fig. 5. The generalization error of the NMC, the bagged NMC (left plot) and the boosted NMC 
(right plot) using different combining rules for 30-dimensional Gaussian correlated data 



Bagged NMC (B=250) Boosted NMC (B=250) 




Fig. 6. The generalization error of the NMC, the bagged NMC (left plot) and the boosted NMC 
(right plot) using different combining mles for 8-dimensional banana-shaped data 
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usually used as a combing rule in bagging, performs the worst. For this data set, the 
best combining rules for bagging are the average, the weighted majority vote and the 
product. Comparing the left plot for bagging and the right plot for boosting in Fig. 5, 
one can clearly see that boosting outperforms bagging for each combining rule 
respectively. In boosting, wrongly classified objects get larger weights. Mainly, they 
are objects on the border between classes. Therefore, boosting performs the best for 
large training sample sizes, when the border between classes becomes more 
informative. In this case, boosting the NMC performs similar to the linear SVC [14]. 
However, when the training sample size is large, the NMC is stable. It puts us on an 
idea that, in contrast to bagging, the usefulness of boosting does not depend directly on 
the stability of the classifier. It depends on the “quality” of the wrong classified objects 
(usually, the border between data classes) and on the ability of the classifier (its 
complexity) to distinguish them correctly. As concerns combining rules, we see that 



Bagged NMC (B=250) Boosted NMC (B=250) 




The Training Sample Size per Class The Training Sample Size per Class 



Fig. 7. The generalization error of the NMC, the bagged NMC (left plot) and the boosted NMC 
(right plot) using different combining mles for 128-dimensional pump data with 2 classes 



Bagged NMC (B=250) 




Boosted NMC (B=250) 




Fig. 8. The generalization error of the NMC, the bagged NMC (left plot) and the boosted NMC 
(right plot) using different combining mles for 128-dimensional pump data with 4 classes 
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the choice of the combining rule is less important for boosting, than for bagging. When 
boosting the NMC on the 30-dimensional Gaussian correlated data, all combining 
rules perform similar to each other with the exception of simple majority vote, which 
is reasonably worse. 

On the 8-dimensional banana-shaped data (see Fig. 6), bagging and boosting 
are also useful for the NMC. However, due to non-Gaussian data distribution and a 
lower instability of the NMC, the obtained improvement is not so spectacular as in the 
previous case. Bagging outperforms boosting for critical training sample sizes. 
However, boosting performs slightly better on large training sample sizes achieving the 
performance of the SVC. One can also see that some small difference exists when 
different combining rules are used in bagging and boosting. Simple majority vote is 
again the worst combining rule for bagging. For this data set, the product combing rule 
is the best when bagging the NMC. In boosting, the weighted majority vote combining 
rule is slightly better than other combining rules when training sample sizes are not 
large. 

When considering 128-dimensional pump data for a 2-class and a 4-class 
problem, one can see that the NMC is more stable (Fig. 4) on this data set than on other 
data sets, and bagging is almost useless (Fig. 7 and Fig. 8). Boosting becomes useful 
only when the number of training objects is larger than the data dimensionality. In this 
case, boosting performs better for a 2-class problem than for a 4-class problem, 
because to solve a 2-class problem is easier than a 4-class problem. However, to make 
more conclusions about the performance of boosting for large training sample sizes is 
difficult, as only limited amount of data is available (225 objects per class). Therefore, 
it is impossible to check whether the boosted NMC performs similar to the SVC, when 
the number of training objects exceeds 200 per class. Nevertheless, the results also 
show that the choice of the combining rule might be important. In a 4-class problem, 
using the weighted majority vote in bagging and boosting is more preferable than 
using other combining techniques. In a 2-class problem, boosting with the simple 
majority vote combining rule performs better than with the weighted majority vote 
combining rule, which is surprisingly the worst for this data set. It seems that it does 
not exist the unique combining rule which is the best for all data sets and for all 
training sample sizes. 

5 Conclusions 

Summarizing simulation results presented in the previous section, we can 
conclude the following: 

Bagging and boosting may be useful in linear discriminant analysis. 

Bagging helps in unstable situations, for critical training sample sizes. 

Boosting is useful for large training sample sizes, when the objects on the 
border between data classes are enough representative to separate data classes 
correctly and the classifier is able (by its complexity) to distinguish them well. By that, 
boosting sometimes allows us to achieve the performance of the support vector 
classifier. The performance of boosting does not depend on the instability of the 
classifier. 

The choice of the combining rule might be important. However, it strongly 
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depends on the data and the training sample size. 

When comparing the performance of bagging and boosting, it should be done 
on the fair background, when the same combining rule is used in both methods. 

As a rule, simple majority vote is the worst possible choice of the combining 
rule. The weighted majority vote rule is often a good choice as for bagging as for 
boosting. The average, mean, and product combining rules may also perform well and 
sometimes better than the weighted majority vote combining rule. 
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Abstract. In the proposed paper, we investigate the combination of the 
multi-expert system in which each expert outputs a class label as well 
as a corresponding confidence measure. We create a special confidence 
measurement which is common for all experts and use it as a basis for the 
combination. We develop three combination methods. The first method 
is theoretically optimal but requires very large representative training 
data and storage memory for look-up table. It is actually impractical. 
The second method is suboptimal and reduces greatly the required trai- 
ning data and memory space. The last method is a simplified version of 
the second and needs the least training data and memory space. All three 
methods demand no mutual independence of the experts, thus should be 
useful in many applications. 

Keywords: Expert, classifier, combination methods, OCR, confidences, 
Bayes rule 



1 Introduction 

In the area of pattern recognition, practical applications require highly relia- 
ble classification which may be difficult for a single algorithm to achieve. Since 
there are a number of classification algorithms in the literature, based on diffe- 
rent theories and methodologies, a combination of these can be used to improve 
the classification performance by taking advantages of their strengths and avoi- 
ding their weaknesses. The task is quite challenging because the decisions of the 
individual experts are conflicting. 

The idea of combining the decisions of multiple experts has been explored by 
many researchers ra-nni. In general, based on the output information, there are 
three types of experts: Type I that outputs a unique class label indicating the 
most probable class to which the input pattern belongs; Type II that outputs a 
ranked list of part or all of class labels such that the higher a class label is in the 
list, the more probable it is that the input pattern belongs to the corresponding 
class; Type III that assigns to each class label a measurement value which indi- 
cates the degree by which the corresponding class pertains to the input pattern. 
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Combining Type III experts is the most challenging, because of the potentially 
many possible combinations of measurement values and the complicated rela- 
tions between these values and the experts performance. The problem is even 
further complicated by the lack of standard measurements and in this sense, the 
measurement values of different experts are usually not compatible. 

Previous studies have developed many different approaches for expert com- 
bination. For experts of Type I for which only labels are available, voting al- 
gorithms are used (Q> Q’ 13)’ Label rankings are used with experts of Type 
II (0, 0, |5]). In case of Type III experts with measurement values interpre- 
ted as posteriori probabilities, a Bayesian technique is often applied for experts 
assemblies (El. □)• If the expert output is interpreted as fuzzy membership 
or evidence values, fuzzy rules (0,0) and Dempster-Shafer approach (0,nni, 
m) are used. Also there are cases of expert combination, where the output of 
the expert is used as a new feature and a new expert (neural network) is built 
to perform the combination (H 2 i, usi, m- 

In this paper, we focus on a simplified version of Type III multi-expert com- 
bination. Each expert uses its own representation, i.e. measurements extracted 
from the input pattern are unique to each expert. We create a new accuracy 
measurement scale, uniform for all experts and use it as a basis for the expert 
combination. We develop an optimal combination scheme which requires extre- 
mely large amount of training data as well as memory space. So we introduce an 
empirical scheme to approximate the optimal scheme so that the requirements 
on the ammount of training data and memory is practical. We first develop the 
accuracy measure which is common for all experts. Next, we characterize each 
expert with a family of accuracy maps. Next, we build accuracy combination 
maps with a special synthetic function. Finally, we construct synthetic accuracy 
maps for the combined confidences. We also propose a simplified combination 
scheme which requires less training data while sacrificing some accuracy. We 
finally discuss the optimal rejection threshold for the final recognition decision. 

In Section II, we state the problem formulation. In Section III, we introduce 
the accuracy approximation method. We describe the optimal combination rule 
in Section IV and an empirical combination rule in Section V. We derive the 
optimal rejection threshold in Section VI. We give simulation results in Section 
VII and finally draw conclusions in Section VIII. 

2 Problem Formulation 

Many classifiers are able to supply confidence information from the measurement 
level. Bayes’ classifier supplies the a posteriori probabilities as confidence mea- 
surements. Various distance related classifiers use the distance between a test 
pattern and template/prototype patterns as confidence measurement. 

In this paper, we assume that the given experts only supply the top choice 
and the corresponding confidence. Specifically, let represent expert (clas- 
sifier) n, where n = 1,2,... ,N, and N is the total number of experts. A = 
are mutually exclusive and exhaustive set of class la- 
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bels. , E 2 ^\Ti)) means that expert n assigns the unknown 

pattern x to class label € A with confidence E^\-x) G [0, 1]. 

Our task is to approximate the conditional accuracy distribution function for 
a multi-expert system P{E^'‘\ {e‘'^\ {E^\e^'^)), i = 

1,2,... , N hy using the given training samples. 

3 Accuracy Approximation 

Recognizers usually differ in their confidence measures. A judicious combination 
of these measures can be made only when they are on the same scale. We will 
develop a way to transform confidence measures from different recognizers to 
a “special” confidence, we call accuracy. For a given set of “large enough” as 
well as “properly representative” training samples, the accuracy measure can 
be obtained by transforming the confidence using sufficient training samples. 
Usually, the accuracy is an increasing function of confidence. 

Let us assume that the training set is adequately representative. Now let 
us discuss the accuracy approximation by utilizing the training samples. We 
assume that the confidence measures of all classifiers are continuous, that is, the 
confidence values can be any point in [0, 1]. There are many methods that can 
implement this transform. Here we introduce a simple and efficient method. 

Let L patterns be classified to a certain class label by an expert and have 
confidence between [a, b) . Let t out of L patterns be correctly classified, then we 
can assign the approximate accuracy over [a, 6) as fl{r) = Vr G [a,b). For 
a given error bound e, we claim that L has statistical sense if the probability 
that the approximate accuracy p,{r), r G [a, 6) is within the error e from any true 
accuracy value /r(r), Vr G [a, b) is greater than 1 — e, that is, 

Pr{ max {\p.{r) — n{r) \ < s}) > 1 — e. (1) 

r-e[a,b) 

According to (P), the difference of two accuracy values between two adjacent 
representative intervals [a, b) and [6, c) satisfies 

Pr(|/i(r 2 ) - /i(?'i)l < 2e) > 1 - 2e, Vri G [a, 6), Vr 2 G [6, c). (2) 

At the same time, the accuracy value is within [0, 1]. So the number of represen- 
tative intervals over [0,1] must be greater than l/(2£). Therefore, the number 
of training samples to approximate accuracy is of the order of 0{L/{2e)) for a 
class label. Since there are M labels, totally 0{ML/{2e)) samples are required 
to capture the accuracy characteristic of an expert with error tolerance within 
e. 

In the real implementation, for a given e we are not able to estimate L 
according to ®. Instead, we estimate L by making the difference of two accuracy 
values between two adjacent representative intervals [a, b) and [6, c) less than 2e, 
that is. 



0 < /i(r 2 ) - M(ri) < 2e, Vri G [a, b), Vr 2 G [6, c). 



( 3 ) 
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4 Optimal Combination Rule 

Let us assume we are given sufficient number of representative training samples 
so that we are able to generate the conditional accuracy distribution function 
for a multi-expert system, P{E^'’\ {e[^\ E^'^), (e[^\ E^^'^)), 

1 = 1,2,... ,N. This “Behavior-Knowledge Space” scheme produces the optimal 
combination performance m 

However, such approach needs large number of samples. Let us find the lower 
bound on the number of samples for the above accuracy approximation with error 
bound e. There are 0{M/(2e)) choices for each pair {e[^\ E^^), i = 1,2, . . . ,7V 
and thus [7W/(2e)]^ choices for {{e[^\ , (e[^\e^^^)}. For 

each choice, at least L samples are required. Therefore, a total number of 
0(L[7Vf/(2e)]^) samples are necessary to build the above “Behavior-Knowledge 
Space”. Also, the memory of 0(7V[7VL/(2e)]'^) are required to build the look-up 
table of the joint accuracy distribution. 

5 Empirical Combination Rules 

5.1 Accuracy Combination Ftinctions 

The accuracy combination function is of the form ^( 01 , 02 , . . . ,<!„), where G 
[0, 1], 1 < i < n are n accuracy variables. F{-) is supposed to be symmetric and 
in [0, 1]. Moreover, F(-) must satisfy the following two special properties. 

F(ai,02,... ,a„) = 1, if Ofc = 1, 3A:; (4) 

F(oi,a2,... ,a„) = 0, if = 0, 3/c. (5) 

The justification of 0) and o is obvious. In fact, it is never the case that = 1 
at the same time as Oj = 0 (theoretically). Therefore, in this case the function 
F(-) is supposed to be non-existent. 

A family of functions satisfying the required conditions are: 

1 ” 

F(ai,02,... ,a„) = A > 0. (6) 

n ^ ' 

i=l 

where A is a parameter and 7i(r), r G [0,1] is a strictly ascending function 
satisfying 

7i(0) = - 00 ; (7) 

7i(l) = 00 . (8) 

Here, we consider 00 as an existent number. Four simple examples of Ti(-) are 
listed as follows: 

h(r) = tan(7r(r — 1/2)), rG[0,l]; (9) 

h(r) = (l/2-|r-l/2|)-i(r-l/2), rG[0,lj; (10) 

h{r) = {r-r^)-\r-l/2), r G [0, 1]; (11) 

h(r) = (r — r^)“^/^(r — 1/2), rG [0,1]. (12) 
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5.2 Combination Scheme 

For each classification distribution of the multi-expert system (Ci, C 2 , . . . , Cn), 
we build an accuracy map, called characteristic accuracy map for each expert. 
We note that there are possibilities. So we build characteristic accu- 
racy maps in total for each expert. In practice, we might not have all the 
possibilities, since some of permutations do not exist. The required number of 
training samples with error bound e is 0{M^ L/{2e)). We note the required trai- 
ning samples is just linear in the number required to approximate an accuracy 
map, thus greatly reducing the required number of training samples. 

For a given distribution {Ci,C 2 , ■ ■ ■ , Cat), we denote a set of training samples 
by X such that 

= ,Cn), VxGx- (13) 

Let r denote the expert indices set such that Ci = C,'ii & E and \E\ > 2. Let 
E 2 , i € r denote the characteristic accuracy maps constructed from \ &nd 
e[^^ (x) = C, yi G r, Vx G X- We define the synthetic accuracy as 

EP{^) = F{Ei^\^):tGE). (14) 

When the combination scheme is given, we can easily get the combination 
accuracy E^ from the synthetical confidence E\ using the data set x- The 
generated accuracy map is supposed to be an ascending function of the combi- 
nation confidence. 

Now let us discuss the maximum memory required for the look-up table 
for all accuracy maps. For simplicity, we just consider the number of accuracy 
maps. The number of synthetic accuracy maps which combines exactly n accu- 
racies are — 1)^“”. The total number of synthetic accuracy maps is 

- 1)^-” = M(M^ - N{M - 1)^-1 - (M- 1)^). The maxi- 
mum number of accuracy maps, including characteristic and synthetic maps, are 
M{M^ - N{M - 1)^-1 - (M - 1)^) -b NM^. 



5.3 Simplified Combination Scheme 

In the original scheme, we need to build 0{M^) accuracy maps for each expert. 
Usually, M is a large number, e.g., M = 10 in numerical recognition; M = 
10-1-26 X 2 = 62 in character recognition. So can be a very large number even 
for = 2. Thus this method still needs quite a large amount of training data. So 
instead of collecting training data for each specific distribution (Ci, C 2 , . . . , Cn) 
in the original scheme, we can collect training data for each case such that 
Ci = C, \/i G r, where C G A, and F is a set of expert indices such that |F| > 2. 
This method requires smaller set of training data, however this is gained by 
sacrificing the performance. This method performance is inferior to the original 
scheme when given sufficient representative training data. 
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The synthetic accuracy map construction procedure is same as the original 
scheme. For a special expert n, expert index set F including index n has 2^^“^ 
choices and class label C has M choices. Thus we need to build 2^~^M characte- 
ristic accuracy maps for each expert. Therefore 0{2^~^ML/{2e)) limit samples 
are required. In return, we curtail the required number of training samples by 
0{{M/2)^~^) times. When M = 2, both schemes become identical. 

Now let us discuss the maximum number of accuracy maps. The number of 
synthetic accuracy maps which combines exactly n accuracies are M . The 
total number of synthetic accuracy maps is '^n= 2 ^Cn) ~ M{2^ — 1 — N). 
The maximum number of accuracy maps, including characteristic and synthetic 
maps, are — 1 — In comparision with the original combination 

scheme, we also reduce the necessary memory by 0{{M /2)^~^) times. 

6 Optimal Rejection Threshold 

In the final stage of making the recognition decision, we have to make one of two 
decisions: acceptance or rejection. There is a cost associated with both error as 
well as rejection. Trade-offs between the rejection and error ratio must be made. 
We follow the optimization objective as in uni 

Fobj — cyRri^^j^ . 

where Rerr and Rrej are error ratio and rejection ratio, respectively, and 0 < 
a < 1 is a deterministic parameter. 

A natural way to determine the recognition class is to choose the class which 
has the maximum accuracy upon the proposed scheme. However, we need to 
decide to discard or accept the recognition class according to the recognition 
accuracy. 

Let us determine the accuracy threshold which minimizes the objective de- 
fined in (O. In fact, we can explcitly express both items Rerr and Rrej as 
functions of a threshold 9 as follows: 

Rerr= [ {I ~ r)dr = 1 /2 - 9 + 9^ /2 
Jr=e 

Rrej = [ Idr = 9. 

Jo 

Hence, we have 

Rob-i = min {1/2 -9 + 9“^ /2 + a9\ (18) 

ee[o,i] 

Taking the derivative of both sides with respect to 9, we obtain 



(16) 

(17) 



( 19 ) 
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Thus the optimal accuracy threshold 9 is given by: 

9 = 1- a. ( 20 ) 

That is, when recognition accuracy is less than the decision threshold 9 = 1 — a 
we reject the result, otherwise we accept it. 

When a = 1, rejecting an unknown pattern is equivalent to misclassification, 
as we would like to accept all recognition. This verifies the optimal threshold 
0 = 0. When a = 0, the objective is to minimize the error rate alone, so we 
would accept an unknown pattern only if it has recognition accuracy 1, that 
is, no error is made (theoretically). This again verifies the optimal threshold of 
9=1. 

7 Experimental Results 

The training set used in the construction of the accuracy approximation and the 
testing set were created using digit samples extracted from the US mail stream. 
There are two reasons why we use our own database. First, the recognizers used 
in the combinations schemes achieve almost 100% correct rate on databases 
available publicly, such as NIST. Second, all classes are equally represented in 
the training set which is not the case with other databases. The training set 
contains 120,000 digit samples, and the testing set contains 30,000 digit smaples. 



Table 1. Performances of binpoly and gradient experts 





binpoly 


gradient 


a 


err(%) 


rej(%) 


opt 


err(%) 


rej(%) 


opt 


1/5 


4.49 


31.61 


54.04 


1.92 


12.15 


21.73 


1/10 


0.19 


82.64 


84.52 


1.41 


17.30 


31.35 


1/15 


0.19 


82.64 


85.46 


1.03 


22.27 


37.77 


1/20 


0.19 


82.64 


86.46 


0.82 


26.37 


42.86 



We used the simplified method exactly as it is described in section V.3 and 
the thresholds given in the previous section. Table 1 shows the performance of 
the “binpoly” expert m and the performance of the “gradient” expert HH. 
The “binpoly” expert is a polynomial discriminant algorithm trained to extract 
a relative weighting for each feature in each class. The “gradient” expert enco- 
des local contour variations of the character image into a binary feature vector. 



Table 2. Performance of binpoly-gradient combination 



a 


err(%) 


rej(%) 


opt 


1/5 


1.44 


9.34 


16.54 


1/10 


0.97 


12.88 


22.63 


1/15 


0.76 


15.42 


26.88 


1/20 


0.69 


17.20 


30.92 
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Table 3. Performance of kp and gsc experts 





kp 


gsc 


a 


err(%) 


rej(%) 


opt 


err(%) 


rej(%) 


opt 


1/5 


0.89 


10.96 


15.44 


1.16 


4.99 


10.80 


1/10 


0.78 


12.00 


19.79 


1.00 


6.19 


16.14 


1/15 


0.67 


13.73 


23.84 


0.90 


7.57 


21.14 


1/20 


0.47 


16.25 


25.61 


0.85 


8.58 


25.58 



Table 2 shows the results of the combination of these two experts. As we can 
see, for all values of a, the combination method got significant improvement of 
the objective function - 23.93% for a = 0.2; 27.75% for a = 0.1; 28.86% for 
a = 0.067; and 27.85% for a = 0.05 compared to the values of the objective 
function for the better expert. 

Table 3 describes the performance of the “kp” expert (unpublished) and the 
performance of “gsc” expert US]. Table 4 shows the results of the combination 
of these two experts. The “kp” expert combines the merits of “binpoly” ex- 
pert and “gradient” expert. “GSC” expert extracts features based on gradient, 
struactural, and concavity. As we can see, these are much more accurate experts, 
nevertheless for all values of a, the combination method improve the objective 
function - 5.56.% for a = 0.2; 6.82% for a = 0.1; 13.71% for a = 0.067; and 
17.20% for a = 0.05 compared to the values of the objective function for the 
better expert. 



Table 4. Performance of kp-gsc expert combination 



a 


err(%) 


rej(%) 


opt 


1/5 


1.10 


4.73 


10.22 


1/10 


0.82 


6.85 


15.05 


1/15 


0.66 


8.41 


18.25 


1/20 


0.60 


9.23 


21.17 



8 Conclusion 

We have investigated the simplified version of type 3 multi-expert systems in 
which each expert outputs a class label and a corresponding confidence. We 
have developed a general theoretical framework for optimal posterior-probability 
based combination scheme and have shown that it needs a huge representative 
training set as well as large memory. This is impractical. We have therefore 
developed an empirical approach to approximate the joint accuracy distribution 
function. In this approach, we develop a special measurement, accuracy, which 
is applicable to all experts. We characterize each expert with a class of accuracy 
maps. We also develop a family of special combination functions. Finally, we 
have discussed the optimal accuracy threshold for the recognition decision. 
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This approach doesn’t require mutual independence of experts. In fact, m 
is just a special case of our approach. However, all desirable properties exist from 
the statistical point of view. A “large enough” and “well represented” training 
sample set must be available. If only few samples are collected randomly and 
carelessly, the desired properties of this method cannot be guaranteed m 
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Abstract. In this paper we apply a fc-nearest-neighbour-based data condensing 
algorithm to the training sets of multi-layer perception neural networks. By re- 
moving the overlapping data and retaining only training exemplars adjacent to 
the decision boundary we are able to significantly speed the network training 
time while achieving an undegraded misclassification rate compared to a net- 
work trained on the unedited training set. We report results on a range of syn- 
thetic and real datasets which indicate that a speed-up of an order of magnitude 
in the network training time is typical. 

Keywords: Neural networks, data editing, pattern classifiers 



1 Introduction 

Neural networks have been shown to be a valuable non-parametric pattern classifi- 
cation technique which - subject to certain conditions - can approximate the posterior 
probability of class membership [1]. Due to their computational compactness during 
recall, multilayer perceptrons (MLPs) have attracted much attention and it is this neu- 
ral architecture we consider here. The principal drawback of MLPs is that their train- 
ing is approximately o(n^) and thus scales unattractively. This has lead to a great deal 
of research aimed at reducing the size of the training set of examples to a set which 
nonetheless captures the critical information about the classification mapping in hand. 

Kraaijveld & Duin [2] applied the well-known multiedit algorithm to the training 
set of a neural network which, since it removes class overlap in the dataset, resulted in 
faster training and an unambiguous stopping criterion for the training. Along with 
many others, Kraaijveld & Duin noted that only the subset of exemplars adjacent to 
the decision boundary is required to train a classifier but their application of a con- 
densing algorithm produced poor results because the very small amount of data re- 
maining after condensing was clearly insufficient to adequately constrain the decision 
surface of the subsequently trained MLP. 

A number of other authors have attempted to pre-select training exemplars for neu- 
ral networks with the objective of reducing the training time while not degrading clas- 
sification performance compared to a network trained conventionally on the whole set 

F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 650-657, 2000. 
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of prototype patterns. Hara & Nakayama [3] employed a data pairing method where 
they selected the pairs of exemplars drawn from different classes but with the mini- 
mum Euclidean separation although the sparseness of the edited set produced means it 
will probably be quite sensitive to sampling effects. Cachin [4] compared a number of 
techniques which use different strategies to make more frequent presentations to the 
partially trained network of those patterns for which the training error is highest. 
Although the number of presentations of the training set required to attain a given 
error rate is reduced by maybe half compared to conventional cycling through the 
dataset, a single presentation in the technique studied by Cachin comprises multiple 
repeats of individual patterns within the training set and it is unclear whether there is 
an overall saving in CPU time. Further, it is also unclear whether increasing the rela- 
tive frequency of presentation of some patterns in the training set implicitly changes 
the priors on the problem, thus violating one of the conditions for the network to learn 
the posterior probability [1]. Leisch et al [5] have used an active pattern selection 
strategy to significantly reduce the computational burden of cross-validation of MLPs 
while Hwang et al [6] have employed query-based selection of training points to fine- 
tune an almost-trained network. 

In the present work we have employed a condensing algorithm which uses nearest- 
neighbour estimates of the posterior probability to significantly reduce the size of the 
training set and therefore the training time. In doing this we have taken pains not to 
over-edit the training set such that the reduced dataset is unable to constrain the deci- 
sion boundary sufficiently or lacks resilience to sample size effects; we thus address 
the principal shortcoming of the condensing procedure presented in [2]. In the next 
section we describe our method and show results on a range of synthetic and real data 
in Section 3. The results are discussed in Section 4 and conclusions offered in Section 5. 



2 The Data Condensing Algorithm 

By definition, any datum for which the posterior probability is 0.5 lies on the deci- 
sion surface whereas data with posterior probabilities of either 0 or 1 are remote from 
the decision surface and therefore of minor importance to proper training. Considering 
(without loss of generality) a two-class problem: Our aim is to include in the con- 
densed training set only those data for which 0.5 < p(c|z) < 0.5 + A. where P(.) is the 

posterior probability, C is the class, X is the input vector and X controls the degree of 
data reduction. We thus eliminate data which overlap the other class(es) resulting in 
improved training [2]. In essence we select two bands or ‘strips’ of non-overlapping 
data either side of the decision surface to guard against excessive sensitivity to sam- 
pling. In this way we can constrain the location of the decision boundary without per- 
mitting the network to perform spurious extrapolations in the pattern space. 

To estimate the posterior probabilities we have used the k-nearest-neighbour (k- 
NN). If a datum satisfies the bounds on its probability given above, it is copied into the 
condensed dataset. The condensing algorithm employed here is thus particularly sim- 
ple and straightforward to implement.. 
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3 Results 

In this work we have considered only MLPs with sigmoidal transfer functions and 
used conventional error back propagation minimising a squared error metric to train 
the MLPs. We have applied the present data condensing method to a range of syn- 
thetic and real problems. 

3.1 One-Dimensional Overlapping Ganssians 

The ID Gaussian represents probably the simplest class of problem but is nonethe- 
less instructive. We generated 500 data randomly drawn from two overlapping Gaus- 
sian distributions, N(l,l) and N(3,l). Assuming equal priors, the decision boundary 
is where the variate equals 2.0 and 158 of the data overlap the other class implying a 
misclassification of about 16%. We have trained an MLP with one input node, two 
hidden nodes and one output layer on this problem and the error rates and the compu- 
tational effort and are shown in Table 1. (For convenience we define the ‘computa- 
tional effort’ of training to be the product of the number of iterations to convergence 
and the size of the training set normalised to the corresponding product for the uned- 
ited dataset.) The first two rows of Table 1 show the variation of the size of the train- 
ing set with the parameter, X. 

The error rate for each of the trained MLPs was estimated using a validation set of 
4000 data independent of the training set. Both the misclassification rate and the com- 
putational effort have been averaged over fifty independently initialized networks. We 
have (approximately) optimized the learning rate and momentum parameters used in 
the backpropagation algorithm; we believe this to be a reasonable comparison since 
although we have used fixed learning parameters, in practice these would typically be 
determined using an optimising line search technique. 

In considering the results in Table 1, the reference figures for networks trained on 
the original, unedited dataset of 1000 members are : Error rate of 16.07% and a (nor- 
malized) computational effort of unity. (Throughout this paper we gauge the valida- 
tion error from the minimum figure attained with reference to the independent valida- 
tion set.) Two things are immediately apparent from Table 1: Firstly, the misclassifi- 
cation rate is not statistically different between the network trained on the original 
dataset and the networks trained on the condensed datasets. This is gratifying since 
this meets one of our original objectives. Second, the training effort is reduced by up 
to an order of magnitude by data condensing but all other things being equal, one 
would expect the dataset for X = 0.05 (41 members) to result in faster training than, 
say the dataset for X = 0.45 (612 members). Given that we are defining computational 
effort to be the product of the number of iterations to convergence and the cardinality 
of the training set, clearly the dataset for X = 0.05 is much slower to converge. 

Hara & Nakayama [3] have considered this second phenomenon in relation to their 
data pairing technique and suggest that initially, the (randomly assigned) decision 
boundary is far from its correct location. If all the training data are grouped in a nar- 
row region either side of the true decision boundary, the derivatives which drive the 
backpropagation algorithm will be exceedingly small since we are evaluating these at 
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the data points. (Equivalently, all of the neurons will be operating in the saturation 
regions of the sigmoid transfer characteristic which are almost horizontal.) Thus the 
rate at which the network adjusts to the true decision boundary will be slow. Datasets 
constructed from larger values of X have wider ‘bands’ of data around the decision 
surface thus ameliorating the problem of small derivatives. The computational effort 
results in Table 1 show a minimum for X = 0.25 which is consistent with this explana- 
tion. 

To speed convergence still further we have taken the condensed dataset of 41 data 
for X = 0.05 and added to it 40 randomly selected data - twenty from each class - with 
posterior probabilities in the range of 0.55 < P^C|X j< 1.0 . Thus if the initial random 
decision boundary is far from the true boundary the randomly selected data scattered 
throughout the space should ensure that the derivatives of the error function do not 
vanish. The comparable averaged figures for the X = 0.05 -i- randomly selected data are 
: Misclassification rate = 15.92% and a computational effort of 0.02, a fifty-fold re- 
duction compared to training with the unedited dataset. This represents something of a 
paradox since adding more data results in faster training. 



3.2 Two-Dimensional Synthetic Problem 

We have applied the present technique to a 2D two-class classification problem 
shown in Fig 2(a). We have used 2000 points from each class as a training set (which 
included 289 overlapping data) and we estimated the posterior probabilities over 100 
nearest-neighbours (k = 100). The original dataset is shown in Figure 1(a) and the 
condensed dataset for X = 0.05 is shown in Figure 1(b). Various condensed datasets 
were used to train a 2-8-1 MFP and again results are presented for ‘optimaT learning 
parameters in Table 2; the validation results were obtained over an independently- 
generated dataset of 14,486 members. 

The first column of Table 2 shows the results for training with the original unedited 
dataset; as before, the error and computational effort results have been averaged over 
fifty independent trainings. The trends in the results are identical to the ID problem 
presented previously and similarly, we have also trained with a dataset formed by the 
union of the condensed set for X = 0.2 + 214 random selected points for which the 
posterior probability was > 0.7. In this case, the error rate was 6.35% and the com- 
putational effort was 0.07, in line with the previous findings. 



3.3 Real Datasets: Breast Cancer & Character Recognition 

We have employed the current method on two real datasets: a breast cancer dataset 
and a handwritten character recognition dataset both from the UCI Database [7]. The 
breast cancer data comprised 683 members (we have excluded 16 data with missing 
attributes) and this was randomly divided into a training set of 483 and a test set of 
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200. The training set comprised 167 benign examples and 316 malignant; the test set 
comprised 73 benign and 127 malignant. The data was condensed nsing a A:-value of 
30 and we have trained 9-9-1 MLPs. Table 5 shows the results. 

From the handwritten character dataset [7] we have selected the “B” and “D” data 
to form a two-class problem. We have selected a training set of 1268 members (630 
“B” and 638 “D”) and a test set of 303 (136 “B” and 167 “D”). Using a Uvalue of 60 
we have trained a 16-10-1 MLP and the results are shown in Table 6. 

In summary, both the real and synthetic datasets display consistent results: the use 
of condensed datasets speeds training significantly whilst not degrading the classifica- 
tion performance. 



4 Discussion 
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Figure 1: 2D Synthetic dataset, (a) shows the original dataset 
and (b) shows the condensed dataset for A. = 0.05 



One of the pre-conditions 
under which a neural network 
learns to approximate the 
posterior probability on a 
classification problem is that 
the training sample reasonably 
reflects the data distribution 
on the underlying or parent 
problem. At a point in the 
region of overlapped data, the 
MSB training criterion en- 
sures the network ontput is the 
best compromise between 
competing forces by approxi- 
mating the posterior probabil- 
ity. If the overlapping data are 
removed from a training set 
then clearly the network can 
no longer hope to approximate 
the posterior probability and 
snch a trained network’ s capa- 
bilities as a classifier need to 
be carefully re-evaluated. 
Formally, Bayesian decision 
theory involves two stages : 
Firstly, the evaluation of the 
posterior probability for each 
class; second, picking the class 
with the largest posterior 
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probability'Qxraditionally, neural networks have been used to estimate the posterior 
probability and the second stage has been performed implicitly. If we remove class 
overlap we would expect the network to produce a step change in output on passing 
through the decision surface. In this context, it is clear that the trained network has 
learned the Bayesian decision mapping in its entirety rather than just the input-space- 
to-posterior-probability mapping which it would do if trained on the unedited dataset. 
We thus conclude that networks trained on condensed datasets still implement maxi- 
mum a posteriori (MAP) classifiers but dispense with the (implicit) decision stage. To 
illustrate this, network outputs for the ID Gaussian problem (Section 3.1) are shown in 
Fig 2. The network trained on the unedited dataset - Figure 2 - approximates the poste- 
rior probability whereas the network trained with the condensed dataset approximates 
a binary decision. 

By employing suitably edited training sets we have demonstrated here that savings 
in the MLP training time of factors of 4-50 are possible compared to training with the 
unedited training set although the exact speedup will depend on the mapping, MLP 
architecture, etc. In practice we need to compare the overall training times (time to 
perform the data editing plus the reduced time to train the network). While the data 

condensing algorithm presented here is where N is the size of the dataset, the 

backpropagation algorithm is approximately Thus, as A becomes larger, we 

would expect the condensing-preceding-training approach to yield favourable reduc- 
tions in overall computation time. In practice, however, this asymptotic argument 
understates the potential time savings : Since there is no principled method to deter- 
mine in advance the optimum number of hidden units for an MLP for a given problem, 
it is normal practice to train a range of networks to find the one with the fewest hidden 
neurons that is able to satisfactorily learn the problem {i.e. fastest recall). The con- 
densing algorithm clearly needs to be run only once for given problem so in practice 
the (more favourable) time comparison should be the time to perform M network- 
trainings on the unedited dataset versus the time for one run of the condensing algo- 
rithm plus M network trainings on the condensed dataset. Taken together with the 
effective elimination of overfitting, we believe the present approach represents a sig- 
nificant advance in the training of neural network classifiers. To give a concrete tim- 
ing result, we have trained a 16-16-1 MLP both on an original training set of 16000 
items constructed from the character recognition data [7] as well as the condensed set 
of 1500 data. The total training times are as follows: Original dataset = 16.296 min- 
utes; Condensed dataset = 6.137 minutes (comprising 3.943 minutes for the data con- 
densing algorithm and 2.194 minutes for the MLP training) indicating that the present 
method achieves significant time savings even on moderately sized datasets. 



* Pedantically, the decision stage assigns the class with the minimum cost or risk. Here, without 
loss of generality we are assuming both types of error have equal costs. 
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Figure 2: Comparison of network outputs for MLPs 
trained with the original ID Gaussian dataset and con- 
densed dataset. The network outputs for class ‘B’ - which 
is the complement of that for class ‘A’ - have been omit- 
ted for clarity 



5 Conclusions 

In this paper we have 
reported the results of training 
multi-layer perceptrons (MLPs) 
with datasets condensed by a 
technique which uses A:-nearest 
neighbour estimates of posterior 
probability. In essence, we 
select two non-overlapping 
‘strips’ of data either side of the 
decision boundary such that 
0.5 < p(c|z)< 0.5-1- i where 

p(c|z)is the conditional 

probability of class membership 
and L is a parameter which 
controls the degree of data 



reduction. We have demonstrated that the MLPs so trained have misclassification rates 



not statistically different from networks trained on the original unedited dataset but the 
computational effort needed is significantly reduced. It is difficult to generalise about 
the savings in MLP training time since this will depend on the problem in hand but for 
the representative datasets considered here, an order of magnitude is typical. 

We have been able to further speed the initial stages of training by adding a small 
percentage of randomly selected data to the training set which ensures that the gradient 
values driving the backpropagation algorithm do not become unacceptably small when 
the initial, randomly-assigned decision boundary is far from the true decision surface. 
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Table 1 : Misclassification & computational effort results for various training sets for true ID 

Gaussian problem 





0.05 


0.1 


0.15 


0.2 


0.25 


0.3 


0.35 


0.45 


No of Data 


41 


85 


107 


159 


209 


268 


354 


612 


Error [%] 


16.07 


- 


15.92 


- 


15.92 


- 


15.95 


16.05 


Computation 


0.70 


- 


0.13 


- 


0.09 


- 


0.12 


0.1 



Table 2 : Misclassification & computational effort results for various training sets for the 2D 

synthetic problem 





N/A 


0.2 


0.3 


No of Data 


4000 


278 


492 


Error [%] 


6.37 


6.4 


6.34 


Computation 


1 


0.19 


0.09 



Table 3 : Misclassification and computational effort for various training sets for the breast 
cancer data. The first column shows the results for the unedited dataset and the last column the 
results for X = 0.3 -l- 26 random data 





N/A 


0.2 


0.3 


0.3 H- random 
data 


No. of data 


483 


78 


104 


104 


Error [%] 


3.5 


3.0 


3.5 


3.5 


Computation 


1 


0.09 


0.12 


0.04 



Table 4 : Misclassification and computational effort for various training sets for the handwrit- 
ten character data. The first column shows the results for the unedited dataset and the last co- 
lumn the results for A. = 0.2 -l- 129 random data 



It 


N/A 


0.2 


0.3 


0.2 H- random 
data 


No of data 


1268 


323 


452 


452 


Error [%] 


3.3 


2.97 


3.3 


3.3 


Computation 


1 


0.39 


0.41 


0.19 



Non-linear Invertible Representation for Joint 
Statistical and Perceptual Feature Decorrelation 



J. Malo^, R. Navarro^, I. Epifanio^, F. Ferri^, and J.M. Artigas^ 



^ Dpt. d’Optica, Universitat de Valencia 
^ Dpt. d’Informatica, Universitat de Valencia 
Av. Vicent Andres Estelles S/N, 46100 Burjassot, Valencia, Spain 
® Institute de Optica (CSIC) 

C/ Serrano 122, 28006 Madrid, Spain 



Abstract. The aim of many image mappings is representing the sig- 
nal in a basis of decorrelated features. Two fundamental aspects must 
be taken into account in the basis selection problem: data distribution 
and the qualitative meaning of the underlying space. The classical PCA 
techniques reduce the statistical correlation using the data distribution. 
However, in applications where human vision has to be taken into ac- 
count, there are perceptual factors that make the feature space uneven, 
and additional interaction among the dimensions may arise. 

In this work a common framework is presented to analyse the perceptual 
and statistical interactions among the coefficients of any representation. 
Using a recent non-linear perception model a set of input-dependent fea- 
tures is obtained which simultaneously remove the statistical and per- 
ceptual correlations between coefficients. A fast method to invert this 
representation is also presented, so no input-dependent transform has to 
be stored. The decorrelating power of the proposed representation sug- 
gests that it is a promising alternative to the linear transforms used in 
image coding, fusion or retrieval application^. 



1 Introduction 

Independence among the features is recognized as an intrinsic advantage of a 
given signal representation because it allows simple scalar data processing and 
a better qualitative interpretation of the feature vector m This is why the 
aim of most feature extraction transforms is to find out a complete set (a basis) 
of independent features. Two main factors should determine the basis selection 
problem: the data distribution and the qualitative (geometric) properties of the 
underlying space. The basis functions should not only reflect the principal axis 
of the training set but also the eventual anisotropies of the space. 

This is particularly important in applications involving natural imagery or 
texture description, such as indexing and retrieval, fusion, or transform coding. 

^ The authors wish to thank A.B. Watson and E.P. Simoncelli for their fruitful com- 
ments. This work has been partially supported by the CICYT-FEDER (TIC) pro- 
jects 1FD97-0279 and 1FD97-1910 and CICYT (TIC) 98-677-C02 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 658-^^3 2000. 
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In these cases, in addition to the data distribution, it is usually necessary to 
take into account the properties of Human Visual System (HVS): not every 
scale, texture or colour component has the same relevance for the HVS, and 
undesired perceptual interactions among the coefficients may arise if they are 
scalarly processed. Therefore, in many applications the concept of independence 
of image coefficients has not only a statistical meaning, but it may also be related 
to the intrinsic (perceptual) geometry of the space. On the other hand, the HVS 
has developed efficient representations to deal with natural imagery [,Sp4|5l6j . 
so the knowledge of the geometry of the low-level representation of a general- 
purpose biological vision system is of theoretical interest for image processing. 

Wavelet and local DOT transforms, are widely used in many applications due 
to both statistical and perceptual reasons. On one hand, they are used as an ap- 
proximate fixed-basis Principal Component Analysis (PCA). On the other hand, 
these transforms are similar to the first linear stage in HVS processing. Howe- 
ver, the statistical and perceptual decorrelation obtained with these transforms 
is not complete. 

Recently developed perception models with non-linear interactions between 
the coefficients of wavelet-like representations can show interesting stati- 

stical decorrelation properties 0, but they cannot be used in image processing 
applications because they are not analytically invertible. 

In this work the basis selection problem is analysed from both the statistical 
and the perceptual points of view. Here the covariance and the perceptual metric 
matrices are used together to evaluate the statistical and perceptual interactions 
between the features under a common framework. Also, a fast method to invert 
the most recent perceptual representation |7I8I9I6| is developed and tested. It 
is shown that excellent decorrelation results are obtained from both statistical 
and perceptual points of view just taking into account the perceptual geometry 
of the wavelet-like feature space. In this context, the decorrelating power of 
this representation is compared with fixed linear transforms and (unpractical) 
PCA-like methods that require the storage of ad-hoc basis functions. 



2 Aim of the Feature Extraction Transform 

Matrices of Second Order Relations. The statistical deviations from an 
image oq in a certain feature space are described by the covariance matrix, T : 

T(ao) = £\{a — oq) ■ (a — oq)^] = £ \Aa ■ Z\a^] (1) 

Assuming a norm |S|, the perceptual deviation from oq due to a distortion 
Aa is determined by the perceptual metric, W, of the domain at that point, 

d(ao, ao + Aa)^ = Aa^ ■ W(ao) ■ Aa = ^ WnAa^ + 2 ^ WijAoiAaj (2) 

i i=jtj 

Associated non-aligned ellipsoids. The covariance and the perceptual metric 
matrices are quadratic forms that describe two different interesting ellipsoids. 
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Fig. 1. Ellipsoids describing the data distribution and the space geometry around ao- 



On one hand, F describes the shape of the distribution of image samples 
around ag. Non-zero off-diagonal elements in F indicate a deviation between 
the data ellipsoid and the axis of the space. This deviation implies a statistical 
correlation between features in the training set. On the other hand, W describes 
the shape of the (ellipsoidal) locus of perceptually equidistant patterns from 
oq. The diagonal elements of W represent the contribution of each coefficient to 
the perceived distortion (eq. 0 • Non-zero off-diagonal elements induce additional 
contributions to the distortion due to deviations in different dimensions, i.e. they 
represent perceptual interactions between features that modify the perceived 
distortion. This is a convenient way to represent what is commonly referred to 
as masking: a distortion in may mask the subjective distortion in aj. 

In the most general case these two ellipsoids are not aligned, so their eigena- 
xis, and the corresponding PCA-like basis functions, are not the same. 



Measuring the Statistical and Perceptual Relations Among Features. 

The decorrelating efficiency of a feature extraction transform has been tradi- 
tionally referred to the diagonal nature of the resulting covariance matrix. As 
the non-diagonal elements in W represent the 2nd-order perceptual interactions 
between the dimensions of the feature space, here we propose to evaluate the 
transforms from the perceptual point of view applying to W the same measures 
that have been used for F in the context of transform coding 0. 

In this way, given a matrix, M , that describe the (statistical or perceptual) 
relations between the features, a scalar measure (the statistical interaction, ps, 
or the perceptual interaction, rjp), can be defined comparing the magnitude of 
the off-diagonal coefficients with the magnitude of the diagonal coefficients. 






Aim of the Feature Extraction Transform. In order to minimise the final 
correlations from both statistical and perceptual points of view, the transform 
should find out the eigenaxis of both ellipsoids, i.e., it should simultaneously 
diagonalise, F and W, or simultaneously minimise rjs, and r/p. 
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Given a perceptual matrix, W{ao), and using simple linear algebra it can 
be obtained the linear transform that simultaneously removes both correlati- 
ons cni. However, due to the highly point-dependent nature of W, a different 
linear transform would be necessary for each possible input, which is not a prac- 
tical solution. 

In this work we take a different approach: we use the current non-linear 
perceptual model |7I8IDI6| to map a local DCT into a perceptually Euclidean 
space. Beyond the obvious perceptual decorrelation, we show that this non-linear 
transform has also statistical interest: due to its structure, it also removes the 
residual statistical correlations that remain in the DCT, strongly reducing /ig. 
In this way, both measures, and fj,p, are simultaneously minimised by a single 
adaptive transform which can be inverted without the storage of ad-hoc basis 
functions. 



3 Visual Models and Associated Perceptual Geometry 



Metric and Visual Response. The standard model of human low-level image 
analysis has two basic stages. First the image. A, (in the spatial domain) is 
transformed into a vector, a, in a local frequency domain (the transform domain) 
using a linear filter bank, T. Then a set of mechanisms respond to each coefficient 
of the transformed signal giving an array, r (the response representation): 

A — >■ a — >■ r (4) 



It is well established that the first linear perceptual transform, T, is similar to the 
class of wavelet-like transforms employed in many image analysis applications. 
This is not a casual result, because the low-level algorithms used by the HVS 
should be mainly determined by the statistics of natural images . and, 

as a result, linear PCA-like solutions have been developed in the low-level HVS. 

Not all the basis functions of the transform T are equally perceived so addi- 
tional processing, R, is included to explain these non-homogeneities. The HVS 
models assume that all the components of the r vector are equally important and 
there is no perceptual interaction between them (i.e. the response domain 

is Euclidean), so the (perceptual) geometry of the transform domain (and also 
of the spatial domain) must depend on the nature of the response. 

Given a response model, i?, an explicit expression for the perceptual metric 
in any representation space can be obtained. The change of the elements of a 
tensor under a coordinate mapping depend on the Jacobian of the transform m 
Applying the expressions for tensor transformation to our case, eq. ^ we have. 



W{a) = VR{a)^ ■ W'{r) ■ Vi?(a) 



( 5 ) 



where Vi? is the gradient (or Jacobian matrix) of the non-linear response and, 
W = i, is the metric in the response domain. 

Given a particular perception model, i.e. a (T, R) pair, eq. Elcan also be used 
to compute the metric, and fj,p, in any other representation domain. 
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Non-Linear Energy Normalisation Model. The current models for R as- 
sume that after the application of the linear filter bank, T, the energy of each 
transform coefficient is normalised by a weighted sum of the energy of its neigh- 
bours. The dependence with the neighbour coefficients is given by the convolu- 
tion with an interaction kernel h |iSI6il| . 






Oi I I I Oil 

100 A + (hHaP), 



( 6 ) 



Figure Elshows the parameters of this non-linear energy normalisation model 
and an example of the response for some basis functions of different frequencies. 

The parameters tti, fig. 0a, define a band-pass function that modulates the 
strength of the response for each coefficient i. The parameters /?,, fig. 0b, de- 
termine the point of maximum slope in each response. The values of a and 
[3 have been fitted to reproduce amplitude discrimination thresholds without 
inter-coefhcient masking na. A frequency-dependent (one octave width) Gaus- 
sian kernel, fig. 0c, has been heuristically introduced according to the refs. |yi?SI 
06 ]. 

For mathematical convenience (see section 0) a small linear term (propor- 
tional to I Oil) has been included in the response model. This linear band-pass 
term (fig. 0a) dominates for very low amplitudes. It is consistent with the fact 
that for low amplitude patterns the HVS response is roughly linear and it is well 
described by a band-pass function, the Contrast Sensitivity Function (CSF) m 

In our implementation, the linear transform T is a block DCT, and h includes 
no spatial interactions between neighbour blocks, but the analytical results can 
also be applied to any wavelet-like transform with spatial interactions. 



Perceptual Metric using the Non-Linear Normalisation. Taking partial 
derivatives in eq. 0we have the following gradient matrix: 




Fig. 2. Parameters of the vision model and non-linear response functions. Here, the 
amplitude of the coefficients is expressed in contrast (amplitude over mean luminance) 
which ranges between 0 and 1. The response examples of fig. 0d show the basic (sig- 
moid) behaviour of eq. 0 but they are not general because the response to one coeffi- 
cient depends on the value of the neighbour coefficients. These particular curves were 
computed for the particular case of no additional masking pattern (zero background). 
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The slope of the response has three contributions: two diagonal contributions 
and one off-diagonal contribution given by the interaction kernel. Note that 
from medium to high amplitudes the slope decreases with amplitude, i.e. the 
increase in the response is inhibited for high energy coefficients. Also note that 
the off-diagonal contribution is always negative, i.e. the increase in the response 
to one coefficient is also inhibited by high energy neighbour coefficients. 

The non-diagonal contributions in Vi? give non-diagonal elements in W (see 
%■ EJ. It is clear that the relative perceptual relevance of the DCT features 
highly depend on frequency (the diagonal of W has a low-pass shape), i.e. the 
DCT feature space is perceptually anisotropic. It is also clear that DCT features 
are not perceptually independent because W is not diagonal, i.e. the perceptually 
privileged directions of the DCT feature space are not aligned with the axis of 
the space. This implies that an additional transform is needed to remove the 
perceptual correlation between the DCT features. 

As the metric is input-dependent there are no global privileged directions in 
the space. This implies that the decorrelation transform must be local. 



4 Joint Statistical and Perceptual Decorrelation through 
the Non-linear Normalisation Model 



In this work the non-linear normalisation model is proposed as a feature decorre- 
lation mapping from both statistical and perceptual points of view. First because 
it transforms the DCT domain in a perceptually Euclidean space, and second, 
because, its structure makes it a special form of predictive coder, therefore the 
output, r, should show less statistical correlation than the input DCT. 

The basic idea of predictive coding (or DPCM) is to remove from each 
coefficient the part that can be predicted from its neighbours. If a prediction 
of each coefficient is discounted from the original signal in some way, the cross- 
correlation between neighbour coefficients of the result will be highly reduced. In 
the commonly used DPCM the discount is linear: the prediction is substracted 
from the input giving a decorrelated error signal m- 

The normalisation by a weighted sum of the neighbour coefficients can be 
interpreted as a (non-linear) divisive DPCM (see fig. 0: if the central point of 
the kernel is set to zero (i.e. if the coefficient Oi is not taken into account in 
(h * |ap)i as is done in |B|), the convolution in the denominator can be seen as 
a prediction of the energy of each coefficient from the energies of its neighbours. 
The division will be a different way of discounting this prediction from the input. 

In fact, the prediction stage in the non-linear normalisation model is similar 
to the prediction scheme that has been successfully used in US! to exploit the 
conditional probabilities of the transform coefficients to encode them in a more 
efficient way. This suggest that the normalisation could certainly remove the 
statistical correlation in r. It has been shown that the parameters in eq. El can 
be optimised to maximise the decorrelation of the output E|. However, it is 
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important to remark that the parameters used in this work are empirical, i.e. 
not optimised to improve the decorrelation in a given training set. 

5 Quasi- Analytical Inversion of the Normalisation 

Problem Statement. The prediction kernel which makes the model useful for 
statistical decorrelation also makes it non-invertible. As h is not diagonal, each 
response is coupled with every transform coefficient aj. Therefore, the inver- 
sion of R gives rise to a set of non-linear equations which have no analytical 
solution. There are, of course, a number of numerical methods based on the ite- 
rative search of a solution, a, which minimises some distance, |r — ii(a)|, but their 
convergence is not guaranteed and may be very sensitive to the initialisation. 

Quasi-analytical Inversion. In spite of the non-invertible nature of R, around 
a point Oa, the inverse function can be locally written as, 

a= R~^{ra + dr) = aa + VR~^{ra) ■ dr (8) 

where the unknown gradient of the inverse function can be be related to the 
(known) gradient of the response (see fig.0). This differential equation represents 
the local evolution of the inverse response. If it is integrable, it can be used to 
propagate the solution from any initial conditions^ (ra,Oa), up to the desired 
point ri). The computation of the inverse can be analytically formulated as a 
definite integral. As this integral must be numerically solved we have called this 
method quasi-analytical in contrast to the numerical search-based methods. 

Convergence of the Solution. The existence and uniqueness of the solution 
of an initial value problem is guaranteed if the gradient to be integrated is boun- 
ded PE|- In our case, Vi?(a) should not vanish anywhere. The small linear term 







5 " n 








p 


d 






P 


d 



Fig. 3. Alternative DPCM schemes. Fig. [^a shows the classical substractive DPCM. 




Fig. 4. Inverse computation integrating the increments of the inverse function. In each 
iteration, the unknown gradient is computed from the known response at that point. 
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Fig. 5. Reconstruction errors of a tipical block (solid line) and a difficult block (dashed 
line). The curves are the average over several initial conditions: the mean DCT, a, Ij f 
DCT and a flat DCT. The bars show the dispersion in the distortion due to the different 
initial conditions. The differences below the dashdot line are visually negligible. 



in the response avoids ensures a non-zero slope for every a. This guarantees that 
the integration of eq. 0is possible and always gives the appropriate solution. 

To test the speed and robustness of the inverse computation, the 16 x 16 
DCT blocks of a set of 256 x 256 natural images were transformed according to 
the non-linear normalisation and then inverted back integrating the eq. IHlwith a 
^th Runge-Kutta algorithm. The effect of the initialisation and the number 
of integration steps was explored. Figure El shows the DCT reconstruction error 
as a function of the number of integration steps for two different blocks and 
different initial conditions. The inversion experiments show the following trends: 

— The solution is always found. The experiments confirm the theoretical 
existence and uniqueness result: the proposed method achieves the appro- 
priate inverse (with negligible distortion), for every response block, no matter 
the initial conditions with a reasonably small number of integration steps. 

— Speed. Most of the responses (~90% in the explored images) appropriately 
converge to its corresponding DCT in 3-6 integration steps from very dif- 
ferent initial conditions. The solid line in fig. 0 is an example of this kind 
of blocks. However, we found that ~10% of the DCT blocks, usually cor- 
responding to sharp spectrum regions, require a more accurate integration. 
The dashed line represents the worst-behaved block of the training set. 

— Robustness. The inverse does not substantially depend on the initial condi- 
tions, but on the nature of the block (see fig. El) so the algorithm is insensitive 
to the initialisation. Generic 1// or flat spectra give quite good results. 

6 Decorrelation Experiments 

The decorrelation properties of the proposed representation were compared with 
the standard PCA representation (i.e. the domain of eigenfunctions of F) and 
with the domain of eigenfunctions of W , which will be referred to as Perceptual 
Principal Component Analysis (PPCA). The local DCT which is the best fixed- 
basis approximation to PCA analysis for natural images 0 was also explored. 
The local DCT is also interesting because it is the first linear stage, T, in the 
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Fig. 6. Covariance (upper row) and perceptual metric (lower row) in different domains. 
The qualitative meaning of the elements of these matrices depends on how the 2D do- 
mains are scanned to construct the ID feature vectors. The matrices in the spatial 
domain are the result of a raster scanning. A JPEG-like zigzag scanning has been used 
in the DCT and the other transform domains because the coefficients of similar fre- 
quency are grouped together. According to this, the frequency meaning of the diagonal 
elements of W, and F in these domains progressively increases from zero to the Nyquist 
frequency. For the sake of clarity only the upper-left 176 x 176 submatrix is shown. The 
frequency values of the displayed elements in the DCT domain range from 0 to 26 cpd. 



proposed representation, (T, i?), so it is useful to assess the benefits of the non- 
linear normalisation R. The spatial representation has been included as a useful 
example of highly correlated domain. 

The PCA representation was computed from the covariance around the aver- 
age of a set of natural images. The PPCA was computed from the average per- 
ceptual metric, originally defined over the DCT blocks of the training set. The 
values of T, W, rjs and rjp, in the different domains are shown in figure 0 

The highly non-diagonal nature of W in the spatial domain is an additional 
argument against the spatial domain representation that complements the clas- 
sical reasonings exclusively based on the non-diagonal nature of the covariance 
matrix |1 12] . The DCT domain certainly reduces the statistical and perceptual 
interactions by an order of magnitude with regard to the spatial domain but it 
still doesn’t completely remove none of them. The linear approaches that only 
take into account one of the relations, PCA or PPCA, are not acceptable because, 
in these cases, the other relation is increased in the resulting representation. 

The proposed representation, DCT plus non-linear normalisation transform, 
gives the best results. On one hand, it achieves a complete perceptual decor- 
relation for every input because it works with local (not average) metrics. In 
this sense the perceptual decorrelation is better than in the PPCA or any other 
PCA-like approach such as m- On the other hand, the statistical interaction is 
also highly reduced, almost an order of magnitude with regard to the DCT. 
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7 Concluding Remarks 

In this paper the perceptual correlation between the features of an image repre- 
sentation has been formalised through the perceptual metric matrix in the same 
way as the statistical correlation is described by the covariance matrix. 

We have presented a perceptually inspired image representation that simulta- 
neously reduces the statistical and perceptual correlation between the features. 
It first uses a linear local frequency transform and after a non-linear energy nor- 
malisation is applied to the coefficients. The good statistical behaviour of this 
perceptual model relies on its divisive-DPCM structure. The proposed repre- 
sentation improves the decorrelation properties of a fixed basis representation 
such as the DCT without the basis storage problem of linear input-dependent 
PCA-like transforms because an efficient method to invert it has been presented. 

According to the results presented here, the non-linear mapping R may be 
a very interesting second stage after the linear DCT or wavelet-like transforms 
used in many image analysis applications HZ). 
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Abstract. Feature selection aims to find the most important feature 
subset from a given feature set without degradation of discriminative in- 
formation. In general, we wish to select a feature subset that is effective 
for any kind of classifier. Such studies are called Classifier- Independent 
Feature Selection, and Novovicova et al.’s method is one of them. Their 
method estimates the densities of classes with Gaussian mixture models, 
and selects a feature subset using Kullback-Leibler divergence between 
the estimated densities, but there is no indication how to choose the num- 
ber of features to be selected. Kudo and Sklansky (1997) suggested the 
selection of a minimal feature subset such that the degree of degradation 
of performance is guaranteed. In this study, based on their suggestion, 
we try to find a feature subset that is minimal while maintainig a given 
Kullback-Leibler divergence. 



1 Introduction 

The goal of feature selection is often said to be to find a subset of a given size from 
a given feature set such that the subset has the most discriminative information. 
However, the goal of feature selection has recently changed to finding that is 
most effective for classifiers without the size of the subset given. In large-scale 
problems (over 50 features), there seem to be many garbage features, which have a 
bad influence on the construction of classifiers. It is, therefore, expected that the 
performance of classifiers can be improved by removing such garbage features. 

Many methods have been proposed for feature selection such as Se- 

quential Forward/ Back word Floating Search (SFFS/SBFS) method These 
methods select a feature subset that maximizes a criterion function based on 
the recognition rate of a classifier chosen beforehand. Such an approach is useful 
when we know what kind of classifiers will be used. It is, however, more desirable 
to select a feature subset that is universally effective for any classifier. Such an 
approach is called Classifier-Independent Feature Selection iE] and Novovicova 
et al.’s method jHI is one such method. 

In Novovicova et al.’s method, class-conditinal densities are estimated from 
given data with Gaussian mixture models, and a feature subset of a given size 
that maximizes Kullback-Leibler (K-L) divergence 0 between the densities is 
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selected. However, from the viewpoint of classifier-independent feature selection, 
it is desirable to find a feature subset that is as small as possible but includes all 
information necessary for classification. Therefore, we think it is more important 
to select a feature subset on the basis of performance. Such a trial has been 
carried out by Kudo and Sklansky iHnini. 

In this study, we use K-L divergence to evaluate the performance of a fea- 
ture subset in Novovicova et al.’s method. That is, we use K-L divergence in 
double roles to rank features and to evaluate how useful a chosen subset is. 
Some experiments were carried out to confirm the effectiveness of the proposed 
method. 



2 Novovicova et aZ.’s Method 



In Novovicova et aZ.’s method |^, a class-conditional density is estimated from 
a training sample set of the class using the following Gaussian mixture model: 

M D . 

p(x|w) = n{ foixi\bo^y 

m—1 i—1 ^ 






( 1 ) 



where M is the number of components and a‘^ are weights of components sa- 
tisfying = 1- Also, and boi are the parameters specifing the 

components, is the parameter to indicate a feature subset, and D is the num- 
ber of given features. The function / is a Gaussian specified by parameter b^j, 
and /o is a background Gaussian distribution specified by boi. The parameters 
Q!“ , and boi are estimated by the EM algorithm as maximum likelihood 
estimators. A vector indicates which features are used and which features are 
ignored: = 1 for feature i to be used and (j)i = 0 for feature i to be the back- 

ground. The key is that the density form of each component O is independent 
with respect to features so that we can evaluate individual features indepen- 
dently. The dependency among feautres is absorbed in the mixture. To measure 
how far two densities are from each other, we use K-L divergence: 



J{<P) = 




p(x|f2 — w) / 



Here, 17 denotes the set of classes and P{uj) denotes a priori probability of 
class uj. is the expectation over x from class to, and p(x|w) is the class- 
conditional probability density functions defined by (CQ) . K-L divergence measures 
the separability between two different distributions. If K-L divergence in a feature 
subset is 0, the feature subset has no discriminative information. In Novovicova 
et al.’s method, K-L divergence is given as the sum of that of each feature. They 
rank each feature in order of its K-L divergence, and select the feature subset of 
a given size by removing some features with a small K-L divergence. 
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3 Proposed Method 

3.1 How to Select the Number of Features 

K-L divergence is monotone increasing with respect to the number of features. 
When all features are used, K-L divergence takes the maximum value. As seen 
in [10] , the most important characteristic of evaluation functions for classifier- 
independent feature selection is the monotonicity of the functions. Therefore, we 
use K-L divergence as our evaluation function of a feature subset. Features that 
contribute only a little to increasing K-L divergence is thought to be garbage 
features. Thus, a feature subset is chosen by the following two steps. First, 
feaCJes are sorted in order of the values of K-L divergence. Second, the smallest 
number of features Da such that attains a-degradation of the K-L divergence of 
the full feature set is selected (Fig. 1)[9]. 



Q 

I 



M 



(U /max 
O 

a 
<0 
bO 

Vh 

OJ 
> 




JtA — {\— )B /max 



maximum K-L divergence 
a threshold(%) 

maximum number of features 
selected number of features 



Number of features 



Fig. 1. Method for determining the number of features 



3.2 Fake K-L Divergence 

If the number of training samples is sufficiently large, the estimated distribu- 
tion is expected to be close to the true distribution. Then, the estimated K-L 
divergence is also reliable. However, in practice, because of a limited number of 
training samples, the calculated K-L divergence can increase by adding a feature 
without discriminative information. We call it fake K-L divergence. Indeed, in an 
example using a uniform distribution, fake K-L divergence is observed (Fig. 2). 
In this example, two classes share the same uniform distribution in [0, 1]^^; thus, 
the true K-L divergence is zero for all d. A Gaussian is used for the estimation 
of the uniform distribution. From Fig. 2, we see larger fake K-L divergence for 
a smaller number of samples and also it is proportional to the dimensionality d. 
The fake K-L divergence is approximated by 3.07d/N, where d is the numbeiQf 
features and N is the number of traning samples. It is possible to use this score 
(3.07/fV) as the amount of fake K-L divergence due to a feature without discri- 
minative information when N samples Oe given. However, since this sometimes 
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overestimates the true fake divergence, we use the following estimation. When D 
is the number of original features, we use the difference between the K-L diver- 
gence of ^D-i and that of This is because, after sorting of features in order 
of K-L divergence, the last Dth. feature is expected to have no discriminative 
information. Thus, we can regard this difference as the increase due to the fake 
K-L divergence when one garbage feature is added. Then the influence of fake 
K-L divergence of d garbage features is estimated by 

J(d) = {J{^d) - J{^D-i)} X d (2) 

d=l,2,...,D. 

Accordingly, we have a more accurate estimation of J{^d) by substructing the 
fake K-L divergence J{d>d) from where J{d>d) depends on size N. 




Fig. 2. Fake K-L divergence using a uniform distribution. N is the number of samples. 



3.3 Number of Components 

In Novovicova et al.’s method, there is also no indication how to decide the 
number of components M in O- We determine the number of components by 
the Minimum Description Length (MDL) principle Varying the number of 
components in the proper range, we adopt the number of components M for 
each class that minimizes the following MDL value: 

MDL = —L + -m log N 

m = M{l + 2D) (3) 

L= ^ logp(x|w) 
xex„ 



Here, M is the number of components, D is the number of features, N is the 
number of samples, and denotes the set of training samples of class w. The 
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number of free parameters m is derived from equation To avoid the depen- 
dency for initialized parameters, we carried out this experiments 20 times with 
different initialized parameters and selected the number of components that is 
minimized m the most times. 

4 Experiments 

We dealt with three real data sets. The threshold a is taken as 1.0% or 0.5%, 
and the number of components M is determined by the MDL criterion. If all 
the garbage features are removed properly, the recognition rate is expected to be 
improved for any kind of classifier. Four classifiers that were used to evaluate the 
goodness of the selected feature subset are the Bayes linear classifier, the Bayes 
quadratic classifier, the C4.5 decision tree US) classifier, and the one-nearest 
neighbor (1-NN) classifier. 

1. Mammogram: A mammogram database |2j. 

The database is a collection of 86 mammograms from 74 cases. The 65 
features are of 18 features characterizing calcification (number, shape, 
size, etc.) and 47 texture features (histogram statistics, Gabor wavelet 
response, edge intensity, etc.). There are two classes, one of benign and 
one of malignant tumors (57 and 29 samples, respectively). 

2. Sonar: A sonar database m- 

The task is to discriminate between sonar signals bounced off a metal 
cylinder and those bounced off a roughly cylindrical rock using 60 fea- 
tures, each of which describes the energy within a particular frequency 
band, integrated over a certain period of time. The database consists of 
111 patterns obtained by bouncing sonar signals off a metal cylinder at 
various angles and under various conditions and 97 patterns obtained 
from rocks under similar conditions. 

3. Wdbc: A Wisconsin breast cancer database m 

Thirty features (radius, texture, perimeter, area, etc.) were computed 
from a digitized image of a fine needle aspirate (FNA) of a breast mass. 
They describe characteristics of the cell nuclei present in the image. 
There are two classes, one of benign and one of malignant tumors (357 
and 212 samples, respectively). 

The selected number of components were all one for every dataset. For mammo- 
gram data, the K-L divergence and the estimated fake K-L divergence are shown 
in Fig 0(a), and the corrected K-L divergence (K-L divergence minus fake K-L 
divergence) is shown in Fig0)b). By the correction of K-L divergence, the sel- 
ected number of features became less than that in the case before correction. In 
the following, the corrected K-L divergence are used in all experiments. The va- 
lues of K-L divergence and recognition rates for every number d{d = 1,2, . . . , D) 
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are shown in Fig 0 Recoginition rates as well as results by another classifier- 
independent feature selector, SUB [3 are shown in Tablesd Here, the recognition 
rates are calculated by applying the leave-one-out technique to the data. 





(a) K-L divergence and fake K-L divergence 



(b) Corrected K-L divergence 



Fig. 3. K-L divergence ( — ) and fake K-L divergence (- -) for mammogram data. In 
Figure (a), the K-L divergence and the estimated fake K-L divergence are arranged to 
share the same point at the right end. 



In mammogram data, the selected feature subset succeeded in improving or at 
least maintaining the performance of classifiers except for the quadratic classifier. 
This exception is because the covariance matrix used in the quadratic classifier 
became singular owing to the smaller number samples than the dimensionality 
plus one. In sonar data, the recognition rates of the classifiers except for the 
linear classifier were improved. The densities of the two classes in the sonar 
data share almost the same mean vector, so the linear classifier did not work 
well. In wdbc data, the classification rates of all classifiers were improved, or at 
least maintained, compared with the case where all features were used. In total, 
the fundamental effectiveness of the proposed method as a classifier-independent 
algorithmn was confirmed. In these experimemts, the value of threshold a was 
set to a small constant of 1% or 0.5%. The larger is the value of a, the smaller 
is the number of features selected. The felexibility is left to the user. 



5 Discussion 

The effectiveness of our approach depends on two factors. One factor is how 
well the mixture model approximates the densities, and another factor is how 
well K-L divergence expresses the true performance of a feature subset. In our 
experiments, the divergence curve is sufficiently smooth and becomes flat as 
becoming larger number of features. This allows us to choose a fairly small value 
of a. Ideally, the value of a should be almost zero. If the divergence reflects 
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Fig. 4. Results of selection of the number of features and the recognition rate. 
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exactly the true performance of the feature subset, the flatness shows that there 
are actually some garbage features. 

It is very difficult to determine which feature subset is best in the sense of 
classifier-independent feature selection. Of course, if we have the Bayes classifier, 
this task can be performed easily. However, in that case every feature contributes 
to a certain degree to the classification, even if the degree is small. Then, the 
problem is reduced to finding a compromise between the performance and the 
cost for measurement of the features. 

One method for determining whether or not the selected subset is really 
effective is to confirm whether the performance has been improved in as many 
classifiers as possible. Another method is to focus on the performance of the best 
classifier. For example, the quadratic classifier has a high recognition rate at size 
28 in the mammogram data. This suggests that the feature subset for classifier- 
independent feature selection must be a subset larger than 28. In this regard, 
our feature subset is preferable to that of SUB. For practical use, the user also 
can select a larger value of a in balance of the priorities of the measuremental 
cost and performance. 



Table 1. Recognition rate obtained by the leave-one-out technique. Values in paren- 
theses are the number of selected features. An up-arrow means on that recognition rate 
was improved compared with the case that all features were used and a down-arrow 
means on that recognition rate was degraded. 

(a) mammogram data (b) sonar data 





Recognition rate[%] 


Classifier 


Proposed method 


SUB 


ALL 




a=1.0% 


a=0.5% 








(36) 


(42) 


(10) 


(65) 


Linear 


82. 6t 


80. 2t 


89.5 


65.1 


Quadratic 


33.74 


33.74 


88.4 


66.3 


1-NN 


70. 9t 


69.84 


77.9 


66.3 


C4.5 


76.7 


76.7 


76.7 


76.7 



Recognition rate[%] 



Classifier 


Proposed method 


SUB 


ALL 




a=1.0% 


a=0.5% 








(47) 


(49) 


(35) 


(60) 


Linear 


73.64 


74.54 


76.0 


75.0 


Quadratic 


77.94 


79.34 


82.2 


75.5 


1-NN 


86.54 


83.74 


84.6 


82.7 


C4.5 


76.44 


76.44 


68.8 


75.5 



(c) wdbc data 




Recognition rate[%] 


Classifier 


Proposed method 


SUB 


ALL 




a=1.0% 

(22) 


a=0.5% 

(24) 


(24) 


(30) 


Linear 


96.04 


96.04 


97.4 


96.1 


Quadratic 


95.84 


95.84 


96.0 


95.6 


1-NN 


91.6 


91.6 


90.5 


91.6 


C4.5 


95.34 


95.34 


94.6 


94.9 
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6 Conclusion 

We proposed a method for finding a feature subset from the viewpoint of classifier- 
independent feature selection. The fundamental effectiveness of the proposed 
method was comfirmed by the results of experiments conducted on three real 
data sets. Further examination of the effectiveness of the proposed method using 
more classifiers is needed. 
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Abstract. The performance and speed of three classifier-specific feature 
selection algorithms, the sequential forward (backward) floating search 
(SEES (SEES)) algorithm, the ASFFS (ASBFS) algorithm (its adaptive 
version), and the genetic algorithm (GA) for large-scale problems are 
compared. The experimental results showed that 1) ASFFS (ASBFS) has 
better performance than does SFFS (SBFS) but requires much compu- 
tation time, 2) much training in GA with a larger number of generations 
or with a larger population size, or both, is effective, 3) the performance 
of SFFS (SBFS) is comparable to that of GA with less training, and the 
performance of ASFFS (ASBFS) is comparable to that of GA with much 
training, but in terms of speed GA is better than ASFFS (ASBFS) for 
large-scale problems. 



1 Introduction 



Many algorithms for feature selection have been proposed jf f2l3l4lhyff7j . These 
algorithms are divided into two categories: in one group jf p2f4|5] the goodness 
of a feature subset is measured by the value of a given criterion function, and 
in the other group the goodness is measured by their own criteria. Algo- 

rithms belonging to the former group are called “classifier-specific feature selec- 
tion algorithms” because their criterion functions are usually based on a correct 
recognition rate by a certain classifier. This is useful when we know in advance 
which classifier will be used after feature selection. Algorithms of the latter group 
are called “classifier-independent feature selection algorithms” because their cri- 
terion functions are based on an approximation of class-conditional probability 
density functions. This means that any classifier-independent feature selection 
algorithm should not depend on particular classifiers to be used but depend on 
the ideal “Bayes classifier.” This approach is useful when we do not know which 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 677-^^^ 2000. 
© Springer- Verlag Berlin Heidelberg 2000 





678 



M. Kudo et al. 



classifiers will be used or when we wish to avoid over-fitting of the criterion 
function depending on a specific classifier. 

In this study, we compare the three promising algorithms for classifier-specific 
feature selection. One is the genetic algorithm, hereafter called GA, which is an 
effective algorithm for finding sub-optimal solutions in optimization problems. 
GAs have been shown to be effective for feature selection. The second algorithm is 
the sequential forward (backward) floating search method, hereafter called SFFS 
(SBFS), which is a genelarized sequential search method and has also been widely 
used for feature selection. The third algorithm is a recently developed adaptive 
version of the SFFS (SBFS), hereafter called ASFFS (ASBFS), in which the 
local greedy search is replaced by a combinational search. The aim of this study 
is to examine how these algorithms work in large-scale practical problems in 
which the number of features is more than 50. 

2 Comparative Studies 

There have been some comparative studies of classifier-specific feature selection 
algorithms IMI . 

Ferri et al. |2| concluded that GA and SFFS are comparable in performance, 
but as the dimensionality increases the result of GA becomes worse than that of 
SFFS. Jain and Zongker compared GA with SFFS for Kittler’s artificial data 
and reported a tendency of premature convergence when GA was used. On the 
other hand. Kudo and Sklansky |S| reported that for small-scale and medium- 
scale problems SFFS or SBFS is better than GA, and GA is better than SFFS 
or SBFS for large-scale problems. In addition, as the dimensionality increases, 
GA becomes faster than SFFS and SBFS. They concluded that GA is better 
than SFFS and SBFS in the following two points: 1) GA is controllable in terms 
of execution time because it can be terminated whenever we want by limiting 
the number of generations, and 2) the results of GA can be improved with more 
trails and with more training. In this paper, we present a comparison of the 
performance and speed of a sophisticated version of SFFS (SBFS) and GA with 
more training. 

3 Algorithms 

Here, we find a feature subset Xd of size d{< D) from the original feature set Y 
of size D. 

3.1 SFFS(SFBS) Algorithm 

Among many reported sequential search algorithms, the best in terms of compro- 
mise between speed and performance are the SFFS (Sequential Forward Floating 
Search) method and the SBFS method (its backward version), proposed in 1994 
by Pudil et al. The number of forward (adding) / backward (removing) 
steps is determined dynamically during the method’s run so as to maximize the 
criterion function (Fig. 
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Fig. 1. Simplified flow chart of SFFS algorithm 




Fig. 2. The meaning of the user parameters rmax and b for adjusting the adaptive 
generalization 



3.2 ASFFS (ASBFS) Algorithm UHl 

The ASFFS (Adaptive Sequential Forward Floating Search) and ASBFS (its 
backward version ) are sophisticated versions of SFFS and SBFS, respectively. 

In addition to the floating search in backtrack, ASFFS enables a more so- 
phisticated search in the local forward search (SFS step in Fig.P) in SFFS; that 
is, the method uses a combinational search of o features, GSFS(o), instead of 
SFS (Sequential Forward Search). Here, the parameter o is upper-bounded by r, 
and r is upper-bounded by Tmax, which is arranged to be large near the desired 
number of features d and be small far from d with an additional parameter b 
(Fig. El). The simplified flowchart of the ASFFS algorithm is shown in Fig.0 The 
terminating condition k = d + A in the flowchart means that in order to fully 
utilize the potential of the search, we should not stop the algorithm immediately 
after it reaches for the first time the dimensionality d. By leaving it to float up 
and back a bit further, the potential of the algorithm is better utilized and a 
subset of size d outperforming the first one is usually found. In practice, we can 
let the algorithm either go up to the original dimensionality D, or if D is too 
large, then the value of A can be determined heuristically (e.g., according to the 
value of the maximum number of backtracking steps prior to reaching d for the 
first time). ASBFS is initialized in the same way as ASFFS, except that k = D 
and Xo = Y . The ASBFS (Adaptive Sequential Backward Floating Search) is 
the ’’top-down” counterpart to the ASFFS procedure. 
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Let k = 0 



Adaptively recalculate r: 

If \k-d I < ^) , let r = r„ca , else if \k-d \ < 
l®t /* = r+ 6 1, else let r=\ 



Let o = 



Conditionally include 
o features found 
by applying one step 
of GSFS(o) algorithm 



Let 0 = 0 + 1 



Take out the conditio- 
nally included features 




Leave in the conditionally 
included features 



Forget the current subset. 
Take the so-far best 
subset of size k+\ 



Let k = k + o 




Let k = k + \ 








Adaptively recalculate r: 

If \k-d I < 6 , let r=r„^, else if \k-d \ < r„^,+ b , 
let r = r„„+ else let r=l 



T 



Let 0 = 1 

— 



Conditionally exclude 
0 features found 
by applying one step 
of GSBS(o) algorithm 



Let k = k - o 



Leave out the 
conditionally 
excluded features 




ro Return the conditionally 

^ excluded features back 



Let 0 = 0 + 1 



Yes 




Fig. 3. Simplified flowchart of ASFFS algorithm 
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Genetic Algorithm El 

Step 0 (Initialization) 

<— { Population size} 

P <— { Initial population with N subsets of K} 
Pc <— { Crossover probability} 

Pm t— { Mutation probability} 

T -4— { Maximum number of generations} 
fc ^ 0 

Step 1 (Evolution) 

Evaluation of fitness of V 
while {k <T and V does not converge) do 
Breeder Selection 
Crossover with pc 
Mutation with p™ 

Evaluation of fitness of V 

Replication 

Dispersal 

fc 4— fc -I- 1 



Fig. 4. Simplified flowchart of GA 



3.3 Genetic Algorithm 



Many studies has been carried out on GAs for feature selection (for example, see 
[II lllilj ) . In a GA, a feature subset is represented by a binary string with length 
D, called a chromosome, with a zero or one in position i denoting the absence 
or presence of feature i. Each chromosome is evaluated in its fitness through an 
optimization function in order to survive to the next generation. A population of 
chromosomes is maintained and evolved by two operators of crossover and mu- 
tation. This algorithm can be regarded as a parallel and randomized algorithm. 
The simplified flowchart of GA is shown in Fig. 2] GA has four main parame- 
ters to be set: the population size N, the maximum number of generations T, 
the probability of crossover pc, and the probability of mutations Pm- Based on 
the results of a study by Kudo and Sklansky P, we use N = 2D,T = 50 and 
(PcPm) = (0.8, 0.1). In addition, we use T = 500 and N = 2QD for giving much 
training to the GA. Two types of initial population of chromosomes are given: 1) 
Type 1, denoted by {1, D — 1}, in which there are 2D feature subsets consisting 
of all 1-feature subsets and all {D — l)-feature subsets, and we add {N — 2D) 
subsets from 2-feature subsets and {D — 2)-feature subsets; and 2) Type 2, de- 
noted by [m, M] in which N chromosomes are randomly chosen for which the 
number of features is in this range. 
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4 Experiments 

As the criterion function, we used the leave-one-out correct recognition rate with 
the 1-NN method. We used r^ax = 3 and b = 3 for ASFFS and ASBFS. GAs 
were carried out in a mode to find the maximum criterion value (corresponding 
to the objective type Oc in the literature (3) and repeated three times in each 
set of parameters. 

4.1 Experiment Using Mammogram Data 

We tested a mammogram database 0. The database is a collection of 86 mam- 
mograms from 74 cases. Chosen 65 features are 18 features characterizing calci- 
fication (number, shape, size, etc.) and 47 texture features (histogram statistics, 
Gabor wavelet response, edge intensity, etc.). There are two classes of benign 
and malignant (57 and 29 samples, respectively). Six sets of parameters used 
in GA are: {N, T, initial population type, Pc, Pm)= (130, 50, {1, 64}, 0.8, 0.1), 
(1300, 50, {1, 64}, 0.8, 0.1), (130, 500, {1, 64}, 0.8, 0.1), 

(130, 50, [10, 14], 0.8, 0.1), (1300, 50, [10, 14], 0.8, 0.1), (130, 500, [10, 14], 0.8, 0.1). 
The results are shown in Figs. Od 

4.2 Experiment Using Sonar Data 

Next, we tested the algorithms using sonar data taken from database [E|. The 
task is to discriminate between sonar signals bounced off a metal cylinder and 
those bounced off a roughly cylindrical rock using 60 features of which each de- 
scribes the energy within a particular frequency band, integrated over a certain 
period of time. The database consists of 111 patterns obtained by bouncing sonar 
signals off a metal cylinder at various angles and under various conditions and 97 
patterns obtained from rocks under similar conditions. As the criterion function, 
we used the leave-one-out correct recognition rate with the 1-NN method. We 
used Tmax = 3 and 5 = 3 for ASFFS and ASBFS. Six sets of parameters used 
in GA are: {N, T, initial population type, Pc, Pm)= (120, 50, {1, 59}, 0.8, 0.1), 
(1200, 50, {1,59}, 0.8, 0.1), (120, 500, (1, 59}, 0.8, 0.1), 

(120, 50, [18, 22], 0.8, 0.1), (1200, 50, [18, 22], 0.8, 0.1), (120, 500, [18, 22], 0.8, 0.1). 
The results are shown in Figs. FI 1101 

5 Discussion and Conclusion 

5.1 SEES (SEES) and ASFFS (ASBFS) 

In both experiments, we observed the following: 

1. In performance, ASFFS (ASBFS) was superior to SFFS (SBFS) overall. 
However, in some local problems of finding a best feature subset of a certain 
size, SFFS (SBFS) was sometimes better than ASFFS (SBFS). This means 
the local optimization strategy does not always give the global maximum. 



1-NN Leave-one-out Recognition Rate 1-NN Leave-one-out Recognition Rate 1-NN Leave-one-out Recognition Rate 



Comparison of Classifier-Specific Feature Selection Algorithms 



683 




Number of Features 



Fig. 5. SFFS vs. ASFFS for mammogram data. 




Number of Features 



Fig. 6. SBFS vs. ASBFS for mammogram data. 




Number of Features 



Fig. 7. ASF(B)FS vs. GA for mammogram data. 
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Number of Features 



Fig. 8. SFFS vs. ASFFS for sonar data. 




Number of Features 



Fig. 9. SBFS vs. ASBFS for sonar data. 




Number of Features 



Fig. 10. ASF(B)FS vs. GA for sonar data. 
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2. ASFFS took about one-thousand times longer than SFFS. Their average eva- 
luation numbers were 164339 and 175, respectively. This ratio is almost the 
same for ASBFS and SBFS. This difference comes from the fact that ASFFS 
(ASBFS) actually belongs to the class of generalized algorithms which are 
obviously more time consuming. 



5.2 GA with More Training 

For the effectiveness of more training with a larger number of generations or 
with a larger size of population, we see from Figs. 0 and E3 that 

1. An increase in the generation number T (open and closed squares) or an 
increase in the population size N (open and closed diamonds) leads to better 
solutions compared with the original settings (open and closed circles) . The 
effectiveness of both enhancements are almost the same. 

2. The results of GA with less training are comparable to those of SFFS (SBFS) 
and are a little inferior to those of ASFFS (ASBFS). 

5.3 ASFFS (ASBFS) and GA 

From Figs. 13 and cni we see that 

1. In terms of ability to find the maximum value, GA with more training (ten- 
times larger generation number or population size) and ASFFS (ASBFS) 
are almost the same. GA requires a few trails to find better solutions but 
sometimes finds better solutions than those by ASFFS (ASBFS). 

2. When ASFFS (ASBFS) is carried out in such a way that it finds feature 
subsets of every size, it is very time-consuming. This is crucial difference 
from SFFS (SBFS). This is because ASFFS (ASBFS) is tuned to have more 
flexibility near a desired number of features d and this flexibility is different 
depending on the value of d, thus we cannot carry out ASFFS (ASBFS) 
sequentially in such a way that after finding a solution at size d it finds 
another solution at size d+ 1. 

3. In terms of speed, GA is faster than is ASFFS (ASBFS). GA required at 
most 67000 evaluations, while ASFFS (ASBFS) needed about three times 
that number. However, since the time taken by ASFFS (ASBFS) strongly 
depends on the size D of the problem and the desired number of features 
d, for small- and medium-scale problems ASFFS (ASBFS) would be more 
effective than GA, as was shown in j5|. 

Based on the results, our recommendations are as follows. If the main pri- 
ority is time, the user should use SFFS or SBFS for small- and medium-scale 
problems and GA with less training for large-scale problems. If he or she really 
wants to achieve the best possible results (of course at the expense of the com- 
putational time), ASFFS or ASBFS should be used for small- and medium-scale 
problems, and GA with more training (for example, ten times generation number 
or population size) should be used for large-scale problems. 
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Abstract. We propose a new method for performance-constraining the 
feature selection process as it relates to combined classifiers, and as- 
sert that the resulting technique provides an alternative to the more 
familiar optimisation methodology of weight adjustment. The procedure 
then broadly involves the prior selection of features via performance- 
constrained sequential forward selection applied to the classifiers indivi- 
dually, with a subsequent forward selection process applied to the clas- 
sifiers acting in combination, the selection criterion in the latter case 
deriving from the combined classification performance. We also provide 
a number of parallel investigations to indicate the performance enhance- 
ment expected of the technique, including an exhaustive weight optimisa- 
tion procedure of the customary type, as well as an alternative backward 
selection technique applied to the individually optimised feature sets. 



1 Introduction 

The non-overlapping of the misclassification errors of very distinct methods of 
classification has lead to the realisation that, in general, no one method of classi- 
fication can circumscribe all aspects of a typical real-world classification problem, 
prompting, in consequence, the investigation of a variety of combinatorial me- 
thods in a bid to improve classification performance [eg 1-6]. Historically, such 
methods have in common that they operate at the level of the compound clas- 
sifiers’ output, typically combining the disparate PDFs in some fashion (eg the 
majority vote and weighted mean techniques familiar to the pattern-recognition 
and sensor-fusion communities), and, as such, not having any direct influence 
on the compositional character of the feature set presented to each of the clas- 
sifiers. We seek to address this deficit by attempting to obtain a near optimal 
feature set for the combined classifiers, in distinction to the optimal set for the 
individual classifiers, treating the latter as a starting point for this objective. 
Hence, by implication, this paper may also be considered an investigation into 
classifier combination as a constraint on feature selection, this being an issue in 
its own right; our primary aim, however, will be to improve combined classifier 
performance. 
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2 Format of the Investigation 



We opt for the most straightforward method of classifier combination; that of 
obtaining the mean of the estimated posterior probability distributions of each 
of the classifiers composing the combination in order to elaborate our technique 
[cf eg 7]. It is not expected that this choice will substantially impact on the broad 
pattern of results that we find in section 4; in particular, the predominance of 
the sequential forward search method over the backward search; further results 
or mathematical argument would be required to provide absolute proof of this, 
however. 

The criterion function for feature selection appropriate to the outlined com- 
binatorial method is then simply the inverse of the misclassification rate arising 
from this mean of the estimated posterior probabilities in relation to the given 
feature set. The technique of feature set modification common to all of the various 
strands of the investigation is then the sequential selection of each of the features 
in turn from a bank of permissible features appropriate both to the particular 
classifier under consideration, as well as the method of feature set selection (ie 
unchosen features in the case of forward selection and unremoved features in 
the case of backward selection), respectively adding or subtracting the chosen 
feature from the existing set presented to that classifier. The previously speci- 
fied criterion function is then calculated for the combination, with the remaining 
classifiers maintaining their existing feature set (or when individual classifiers are 
to be considered in isolation, as below, the criterion function will instead derive 
from the selected classifier alone). The feature/classifier combination with the 
most advantageous criterion function is then appended/removed, as appropriate 
to the method of selection, from the list of permissible permutations. Thus fea- 
ture repetition between (but not amongst) classifiers is an inherent possibility 
in all but the case of the classifiers considered on an individual basis. There is 
then maximal freedom in the allocation of features, given that all of the various 
processes constituting the investigation are, in the broadest sense, sequential sel- 
ection methods and thus subject to the “nesting effect” [cf 8] wherein features, 
once selected, lack any mechanism for removal from the set presented to the 
classifier (with an equivalent, though inverted, problem for backward selection). 
This situation is slightly mitigated in the particular case of sequential backward 
selection applied to the forwardly pre-optimised feature sets outlined below, alt- 
hough results in section 4 indicate that this is not the favoured amongst the 
various possibilities in performance terms. Thus, all of the following techniques 
(the most effective one of which, by default, constituting the proposed method 
of combined classifier optimisation, with the remainder to be considered only as 
relative performance indicators) are invariably sub-optimal; any such predomina- 
ting method may therefore be most usefully treated as a relatively computational 
inexpensive addition to the repertoire of techniques for combined-classier opti- 
misation, lying somewhere between exhaustive classifier weighting optimisation 
and exhaustive feature permutation optimisation, in terms of both execution 
time and performance. 
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The various methods of feature selection optimisation constituting the inve- 
stigation are therefore: 

1. Sequential forward selection employing a combined classifier criterion func- 
tion and permitting unrestricted repetition of features between classifiers. 

2. Sequential backward selection employing a combined classifier criterion fun- 
ction, commencing with complete feature sets for all of the constitutive clas- 
sifiers. 

3. Sequential forward selection applied to each of the classifiers individually, 
employing the inverse of the misclassification rate of the estimated posterior 
probability distribution as the criterion function in each case. 

4. Sequential forward selection (employing the combined classifier criterion fun- 
ction) applied to the individually optimised feature sets for all constituent 
classifiers of the combination as derived from investigation number 3. 

5. Sequential backward selection (employing the combined classifier criterion 
function) applied to the individually optimised feature sets for all classifiers 
in the combination, as derived from investigation number 3. 

6. As a relative measure of the classification performance improvement attri- 
butable to the above processes, we supply an additional exhaustive weight 
optimisation to be applied to the individually optimised classifier/feature 
combinations derived from investigation number 3, acting in combination 
via the usual mechanism (mean, and hence thus now weighted mean). In 
this scenario the feature sets are not subject to change after their initial 
derivation by independent sequential forward selection, the weight modifi- 
cation being the sole source of performance optimisation, the exhaustivity 
of which being guaranteed by a series of nested loops, within which every 
permutation of PDF weight values (to within a specified resolution parame- 
ter) is inherently tested. The efficiency of this method (despite the series of 
nested loops) derives from the fact that the estimated posterior probabilities 
for each of the classifiers need not be re-derived for every iteration of the 
loop, the weights simply acting in multiple combination with the estimated 
class PDFs. This is not the case for the previously listed techniques, all of 
which have therefore a substantially greater (if not necessarily prohibitive) 
execution time. 

3 Nature of Implementation 

3.1 The Data 

The data employed throughout the investigation consists in a twined set of 
expertly-classified geological survey data, one real and the other simulated, the 
latter simulation occurring at a stage prior to the application of the various pat- 
tern recognition methods from which the features are derived, and thus providing 
a measure of the distinction between conceptual and by-sight classification. In 
regard to our investigation, however, the essential difference between the two 
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data sets may be considered simply in terms of their class separability; the si- 
mulated data set exhibits this quality to a far greater degree than the real data, 
for which the class membership ambiguity is considerable. 

The nature of the image processing on the two data sets (which parallelled 
each other exactly) consisted in a battery of 26 cell-based processes for texture 
characterisation, chosen without regard to the particular nature of the classifi- 
cation problem. Thus, from the outset, a particularly high feature redundancy 
was anticipated for the corresponding 26-dimensional pattern vector. 

3.2 The Classifiers 

Four classifiers constituted the combination, chosen to collectively represent the 
gamut of classification philosophies. They are: 

1. Nearest Neighbour Classifier: 

This is a standard “1-NN” nearest neighbour classifier with Euclidean me- 
tric, adopted in place of the more reliable k-NN set of classifiers for reasons 
of efficiency, as well as conformity with the objective of bringing about ap- 
proximate parity of misclassification rates amongst the various classifiers. 

2. Neural Net Classifier: 

A Bayesian neural net classifier consisting of 3 hidden layers. 

3. Normal PDF Classifier: 

A Bayesian classifier employing a normal probability density function esti- 
mator. 

4. Quadratic PDF Classifier: 

As above, but employing a quadratic polynomial fitting function for the 
density estimation. 

4 Results 

The results of the six investigations are tabulated below for the real and syn- 
thetic data sets, respectively, with the training and test set data in both cases 
comprised of 1000 (of a possible 10000) random samples of their respective ori- 
ginals. The processing stages are in each case listed up to the point immediately 
preceding the termination of the procedure, at the point at which the peak of 
performance has been determined as being such (which is to say, exactly one ite- 
ration after the peak itself is reached) . This approach has been adopted primarily 
as an efficiency measure, there being no reason to suppose that there might exist 
further modalities to the performance distribution beyond this single peak; test 
procedures without this imposed terminating condition have tended to confirm 
the validity of the supposition. 

5 Conclusions 

A consideration of the experimental findings set out in section 4 would appear 
to suggest that method number 4 constitutes the most effective of the tested 
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Table 1. Results of investigation number 1. (Real Data) 



Order of feature addition: 


1st 


2nd 


3rd 


4th 


5th 


6th 


Feature added: 


22 


24 


16 


9 


3 


1 


Classifier to which feature added: 


4 


3 


3 


3 


3 


2 


Probability of misclassification: 


0.0622269 


0.0442208 


0.0386601 


0.0264795 


0.0169469 


0.0112538 



Table 2. Results of investigation number 1. (Synthetic Data) 



Order of feature addition: 


1st 


2nd 


3rd 


4th 


5th 


Feature added: 


21 


10 


1 


7 


19 


Classiher to which feature added: 


4 


3 


3 


3 


3 


Probability of misclassihcation: 


0.269688 


0.184305 


0.165239 


0.144101 


0.134706 



Table 3. Results of investigation number 2. (Real Data) 



Order of feature removal: 


1st 


2nd 


3rd 


4th 


5th 


6th 


7th 


8th 


Feature removed: 


21 


13 


2 


7 


12 


17 


20 


21 


Classiher from which feature removed: 


1 


2 


2 


4 


4 


4 


4 


3 


Probability of misclassihcation: 


0.071 


0.062 


0.052 


0.049 


0.047 


0.046 


0.043 


0.042 



Table 4. Results of investigation number 2. (Synthetic Data) 



Order of feature removal: 


1st 


2nd 


3rd 


4th 


5th 


6th 


7th 


8th 


9th 


10th 


11th 


Feature removed: 


16 


17 


13 


25 


11 


7 


14 


4 


15 


10 


14 


Classifier from which feature removed: 


1 


1 


2 


4 


4 


4 


3 


4 


4 


3 


1 


Probability of misclassification: 


0.35 


o 

CO 

CO 


0.28 


0.28 


0.27 


0.27 


0.26 


0.26 


0.26 


0.25 


0.24 



Table 5. Results of investigation number 3. (Classiher 1, Real Data) 



Order of feature addition: 


1st 


2nd 


3rd 


Feature added: 


2 


4 


16 


Probability of misclassihcation: 


0.15848 


0.0581226 


0.0382629 



Table 6. Results of investigation number 3. (Classifier 1, Synthetic Data) 



Order of feature addition: 


1st 


2nd 


3rd 


4th 


Feature added: 


22 


24 


1 


25 


Probability of misclassihcation: 


0.464907 


0.201161 


0.158607 


0.118265 



Table 7. Results of investigation number 3. (Classiher 2, Real Data) 



Order of feature addition: 


1st 


Feature added: 


1 


Probability of misclassihcation: 


0.102608 



Table 8. Results of investigation number 3. (Classiher 2, Synthetic Data) 



Order of feature addition: 


1st 


2nd 


3rd 


4th 


Feature added: 


22 


24 


23 


20 


Probability of misclassihcation: 


0.30257 


0.187621 


|0.159713 


0.157917 
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Table 9. Results of investigation number 3. (Classifier 3, Real Data) 



Order of feature addition: 


1st 


Feature added: 


1 


Probability of misclassification: 


0.109361 



Table 10. Results of investigation number 3. (Classifier 3, Synthetic Data) 



Order of feature addition: 


1st 


2nd 


3rd 


4th 


5th 


Feature added: 


1 


17 


12 


14 


21 


Probability of misclassification: 


0.472368 


0.347748 


0.28005 


0.229898 


0.212351 



Table 11. Results of investigation number 3. (Classifier 4, Real Data) 



Order of feature addition: 


1st 


2nd 


Feature added: 


22 


15 


Probability of misclassification: 


0.0622269 


0.0558718 



Table 12. Results of investigation number 3. (Classifier 4, Synthetic Data) 



Order of feature addition: 


1st 


2nd 


3rd 


Feature added: 


21 


4 


22 


Probability of misclassification: 


0.269688 


0.200055 


0.153634 



Table 13. Results of investigation number 4. (Real Data) 



Order of feature addition: 


Initial State 


1st 


2nd 


3rd 


4th 


5th 


6th 


Feature added: 


- 


15 


18 


23 


2 


16 


20 


Classifier to which feature added: 


- 


2 


1 


3 


3 


3 


3 


Probability of misclassification: 


0.076525 


0.016152 


0.010459 


0.010194 


00 

o 

o 

o 

o 


P 

o 

o 

o 

00 


0.010062 



Table 14. Results of investigation number 4. (Synthetic Data) 



Order of feature addition: 


Initial State 


1st 


2nd 


3rd 


Feature added: 


- 


13 


11 


20 


Classifier to which feature added: 


- 


4 


3 


3 


Probability of misclassification: 


0.160403 


0.0965736 


0.0893893 


0.0862117 



Table 15. Results of investigation number 5. (Real Data) 



Order of feature removal: 


Initial State 


1st 


Feature removed: 


- 


1 


Classifier from which feature removed: 


- 


3 


Probability of misclassification: 


0.0765259 


0.0271415 



Table 16. Results of investigation number 5. (Synthetic Data) 



Order of feature removal: 


Initial State 


1st 


2nd 


Feature removed: 


- 


21 


25 


Classifier from which feature removed: 


- 


3 


1 


Probability of misclassification: 


0.160403 


0.105554 


0.103482 
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Table 17. Results of investigation number 6. (Real Data) 



Classifier: 


1 


2 


3 


4 


Final weight combination: 


0.21 


0.00 


0.05 


0.63 


Unweighted performance: 


0.0765259 


Final weighted performance: 


0.0164173 



Table 18. Results of investigation number 6. (Synthetic Data) 



Classifier: 


1 


2 


3 


4 


Final weight combination: 


0.57 


0.00 


0.22 


0.95 


Unweighted performance: 




0.160403 




Final weighted performance: 


1 0.0871788 1 



approaches to the problem of combined classifier optimisation, producing sub- 
stantially better classification performance than the more conventional weight 
optimisation, albeit at the expense of computation time. That method num- 
ber 5 (the sequential backward selection algorithm applied to the individually 
pre-optimised classifier feature sets) also produced some performance improve- 
ment over the “pre-optimisation” technique alone (albeit to a lesser degree than 
weight-optimisation) indicates that we have still not, in preferentially opting 
for method 4, achieved the optimally performing feature set appropriate to the 
classifier combination. A technique of alternating forward and backward feature 
selection passes applied to the pre-optimised feature sets should, to a greater 
or lesser extent, combine the best of the two differing mechanisms of optimisa- 
tion (to elaborate: the mechanisms of estimation error reduction attributable to 
redundant pattern space dimensionality in the case of backward selection, and 
complementarity of feature information in the case of forward selection). This, 
along with more complex mechanisms of floating feature selection, remains for 
further investigation. 

We note also that the two disparate methods of combined classifier opti- 
misation that the investigation divides into, namely; weight optimisation and 
feature-set optimisation, are in no way mutually exclusive. Notwithstanding the 
parallel format of our presentation of the two techniques for the purposes of 
comparison and contrast, it is perfectly possibly, without significant addition 
to the execution time, to apply weight optimisation to the classifier/feature set 
combination obtained by the prior application of method 4. Thus a further per- 
formance enhancement would be expected, the optimisation methodology for the 
two optimisation techniques being of an entirely distinct nature. 

There is a further, more fundamental level at which the two differing tech- 
niques might be integrated; rather than the two being applied the consecutive 
manner set out above, we might instead include weight optimisation immediately 
prior to the determination of the criterion function for inclusion of individual 
features in method 4 via an additional series of sub-iterations. Within such a 
procedure the finite operation time of the weight optimisation would become far 
more apparent, being repeated at every iteration of the feature selection algo- 
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rithm, but only to the extent of a fractional increase in the total execution time. 
This then, along with the aforementioned possibility of additional alternating 
backward and forward feature selection passes, would appear to represent, to 
the extent that the scope of the current investigation allows, the most promising 
direction for future techniques of combined-classifier optimisation to progress. 

We end with an observation previously alluded to, namely; that there exists 
within the sequential forward selection techniques employed above, the gene- 
ralised tendency for the combined criterion function to favour the addition of 
features to those classifiers with a pre-existing feature-set, rather to those clas- 
sifiers as yet without any features attributed, up to the point at which further 
features increase the misclassification rate for the classifier so favoured. A theory 
as to the origin of this effect is given below: 

6 Discussion 

In attempting to establish why, particularly, it is that features complement each 
other to such a greater extent when contained within a single classifier than 
when distributed over several classifiers, we might envisage the problem meta- 
phorically in terms of one-dimensional projections of a multi-dimensional pattern 
space (a total of two dimensions chosen for simplicity throughout the following). 
We know from the theory of Radon transforms that it is possible to reconstruct 
a two-dimensional pattern space from one-dimensional line integrals taken at 
various angles and intervals across that space only if the angular sampling of 
these lines matches the linear sampling; there is not sufficient information con- 
tained within the line integrals for reconstruction of that pattern space for the 
case in which the linear resolution greatly exceeds the angular resolution. This, 
however, is exactly the situation that occurs when single features of a pattern 
space are considered in isolation within separate classifiers; the act of obtaining 
a single feature for inclusion in a specific classifier is, in effect, to integrate li- 
nearly across the superfluous dimensions of that feature space. In the scenario 
we have outlined, when considering only a total of two feature-space dimensions, 
the angular samples of the pattern space exist at only two points, namely; the 
perpendicular axes of the pattern space. Now, the linear resolution is as great 
as the number of samples in the space, which for our investigation is of the 
order of 1000; clearly, then, this number is far in excess of the angular sample 
rate of 2. Therefore, even for classifiers that obtain extremely good classifica- 
tion performance on the two features considered independently, there can be no 
conceivable method of classifier combination that can recover all of the informa- 
tion that dictates the multi-dimensional morphology of the class structures of 
the pattern-space. The two feature dimensions when contained within a single 
classifier, however, have only the limitations associated with the sample size and 
the classifier itself in determining this morphology. One may therefore see that, 
once a feature has been allocated to a classifier on the basis of classification 
performance alone, further feature additions to the same classifier, if made on 
the same basis within a sequential forward selection scenario, will almost invaria- 
bly follow; all non-exhaustive selection algorithms that treat combined classifiers 
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and employ nested feature sets in any manner will almost invariably, and to a 

similar degree, exhibit this effect. 
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Abstract. We have developed relative feature importance (RFl), a metric for the 
classifier-independent ranking of features. Previously, we have shown the metric 
to rank accurately features for a wide variety of artificial and natural problems, 
for both two-class and multi-class problems. In this paper, we present the design 
of the metric, including both theoretical considerations and statistical analysis of 
the possible components. 



Keywords: discriminatory power, feature selection, feature extraction, 
feature analysis, non-parametric, classifier-independent, relative feature 
importance, multi-class 



1 Classifier- Independent Feature Analysis 

In all feature analysis problems some initial set of candidate features must be identi- 
fied. The candidate features are the result of some external analysis or search process. 
Feature analysis techniques analyze the usefulness of the candidate features. They can- 
not guarantee that there does not exist an as yet undiscovered feature which may be 
more useful. They also cannot guarantee that classification error overall could not be 
reduced using features not in the candidate feature set. 

Since classifier-independent feature analysis is based on the structure of the data, 
features can be analyzed only on the basis of a learning sample. The learning sample is 
a set of correctly classified objects that are represented by feature values for the fea- 
tures in the candidate feature set. Since classifier-independent feature analysis is driven 
by the learning sample, a high degree of confidence in the learning sample is impor- 
tant. 

The learning sample is taken as baseline truth, therefore the classes represented in a 
learning sample are necessarily collectively exhaustive. The problem of missing 
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classes (sometimes called new class discovery [1]) is a separate problem from the fea- 
ture analysis problem. In new class discovery the. features are assumed to represent the 
objects accurately, and are used to explore the structure of the logical space. In feature 
analysis the classes are assumed to be known accurately, and are used to explore the 
structure of the feature space. Thus, the classes can be assumed to be collectively 
exhaustive without significant loss of generality. In contrast, a significant loss in gener- 
ality does result from assuming that the classes are mutually exclusive. In medical 
diagnosis, for example, assuming in the general case that a patient can have at most 
one pathology is unrealistic. 

The goal of classifier-independent feature analysis for classihcation is to measure 
the usefulness of the features in the candidate feature set. Nonetheless, classification 
performance on the learning sample cannot be used in and of itself as a basis for ana- 
lyzing the features for several reasons. First, it has been shown that, in the general case, 
features that optimize classification performance for one classifier may not perform at 
all well in another classifier [2]. More fundamentally, though, classifier-independent 
feature analysis tries to measure the potential for discrimination between classes of the 
features in the candidate feature set, which potential may not be realizable in practice. 

Once classification performance has been eliminated as a measure of usefulness, 
what remains is the separability between the classes. Separability is not subject to the 
theoretical constraints of classification performance. When expressed as Bayes error, 
the separation between class-conditional joint feature distributions places a lower 
bound on classification error that is classifier-independent. Unfortunately, Bayesian 
error is not calculable for many problems. Nonetheless, separation between class-con- 
ditional joint feature distributions gives rise to the potential for classification. Issues of 
calculation aside, classifier-independent feature analysis uses separability between 
classes as the basis for the usefulness of a feature. 

A theoretical constraint placed on feature analysis is that feature rankings are subset 
dependent. Even under the assumption of feature independence, feature rankings can 
change as a function of adding and removing features [3]. Nonetheless, ranking the 
features is a critical component of feature analysis: in medical diagnosis, when test 
results are ambiguous, the physician needs guidance as to their relative value for dis- 
crimination. Therefore, ranking is given within a subset, with the critical ranking being 
that within the optimal subset. The optimal subset of the candidate feature set is 
dehned as the smallest subset with the maximum potential for separability between 
classes. 

2 Measuring Separation: Discriminant Analysis 

Discriminant analysis can be used to extract features that maximize the ratio of the 
separation between classes to the spread within classes, as measured by the between- 
class and within-class scatter matrices. Within-class scatter is a measure of the scatter 
of a class relative to its own mean. Between-class scatter is a measure of the distance 
from each class to the mean(s) of the other classes. Within-class and between-class 
scatter can be defined parametrically or non-parametrically. Parametric scatter matri- 
ces use the learning sample to estimate the distributions of the features through estima- 
tion of parameters for an assumed distributional structure. Non-parametric scatter 
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matrices use the learning sample to perform local density estimation around individual 
samples, and then measures scatter using the local density estimates. 

The parametric versions of the within-class and between-class scatter matrices esti- 
mate the means of the classes based on the entire learning sample. The parametric ver- 
sions assume that a distribution can be characterized by its mean and covariance. Let 
Pi be the a priori probability of class ft);, Zi be the covariance matrix and Mi be the 
mean of class m,-, N be the total number of samples, and L be the number of classes 
present in the learning sample. Parametric within-class scatter is dehned as the aver- 
aged covariance. The a priori probability is estimated from the learning sample as N/ 
N, where Ni is the number of samples from O;. Zi is estimated by , the sample cova- 
riance matrix. Parametric between-class scatter is the scatter of the expected means 
around the mixture means. The components of the between-class scatter matrix are 
estimated using the learning sample in the same manner as the within-class scatter 
matrix. 

RFI uses non-parametric versions of the scatter matrices based on versions proposed 
by Fukunaga and Mantock [4]. They based their non-parametric scatter estimates on 
local density estimates using the A:-nearest neighbors (kNN) technique. They dehned 
the coi local mean for a given class coi and a given sample Xj as 



{xp 



k 

-ix 



1 



^qNN 



( 1 ) 






where is the ^f/r-nearest-neighbor in O;. Because Fukunaga and Mantock 

experimented only with two-class problems, they could use the ft),-local mean for cal- 
culating both within- and between-class scatter. 

While use of the local mean introduces the parameter k, its behavior is well studied. 
With inhnite sample size, the accuracy of the local density estimation improves as k 
increases. With hnite sample size, k is subject to the problem of oversampling, other- 
wise known as Hughes phenomenon [5]. A value of k which is too large for the sample 
size performs local density estimation on non-local samples ! A value of k which is too 
small for the sample size reduces the accuracy of the local density estimation. In prac- 
tice, k is generally set to a small fraction of the number of samples [6]. 

To generalize Fukunaga and Mantock’s approach to more than two classes, the local 
out-of-class mixture mean for each sample x^p is dehned as 



^ . 
r ^ i 





k 

X 



pr * i) 
^qNN 



q = 1 



( 2 ) 



where is the (jith-nearest-neighbor outside of O;. The local mixture mean dif- 

fers from the parametric mixture mean in that it excludes data from a sample’s own 
class. 

Non-parametric within-class scatter is dehned as the averaged scatter, where scatter 
is around the local means: 
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When k = Ni, the local mean reduces to the parametric mean, and therefore the non- 
parametric within-class scatter matrix reduces to the parametric version. Non-para- 
metric between-class scatter is measured as the scatter around the out-of-class mixture 
means: 
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The between-class non-parametric scatter matrix does not reduce to its parametric 
form as does the within-class, because the out-of-class mixture means necessarily 
exclude same class samples, but the relationship is close when k = Nj. 

The use of the fe-nearest-neighbor local density estimates introduces the need to 
choose a distance metric for determining the distance between a sample and its neigh- 
bors. Many distance measures have been proposed for use with kNN error estimation 
[7]. Two commonly used metrics are the Euclidean distance and the Mahalanobis dis- 
tance [6]. Fukunaga and Mantock used Euclidean distance in their original work. 
Mahalanobis distance should also be considered (especially using Fukunaga and Man- 
tock's original algorithm), since it incorporates information concerning the relative 
variance of the features. 

A further refinement introduced by Fukunaga and Mantock was the use of a weight- 
ing factor, Wj, to de-emphasize samples which lie far away from the classification 
boundary. RFI uses the natural multi-class extension of Fukunaga and Mantock’s 
weighting factor as given in [8]. Using the weighting factor, the contribution of each 
Xj to scatter is inversely proportional to its distance from the nearest classification 
boundary. 

Thus, the final forms for non-parametric within-class and between-class scatter are 
(estimating components as necessary using the learning sample): 



i = I 7=1 



( 5 ) 



and 






N ^ r J 
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3 Theoretical Implications 

The optimal extracted features are found by eigensystem decomposition of the ratio of 
the between-class to within-class scatter matrices. Specifically, the optimality criterion 
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used is the trace: 



J = (7) 

Thus, for both the parametric and the non-parametric forms, the eigenvectors form the 
linear transform which maximizes J, the ratio of the between-class to within-class 
scatter. The eigenvalues measure the amount of separation induced in the extracted 
space. The extracted features are optimal in the sense that they maximize separation 
between the class-conditional joint feature distributions in the rotated space. 

Using the non-parametric scatter matrices, feature extraction is based on local den- 
sity estimation. Thus the results are a compromise between information provided in the 
various clusters or regions belonging to a class [8]. 

While it is not possible to define the class of problems for which the non-parametric 
scatter matrices accurately capture the discriminatory power of the features, it is never- 
theless possible to characterize those problems which are pathological. A problem 
under consideration can then be compared to the pathological problems in an attempt 
to determine the suitability of RFI for the problem. 

One class of problems which are pathological for RFI, regardless of the use of para- 
metric or non-parametric scatter, are problems which violate the assumption that prox- 
imity in feature space can be used to determine class membership. These problems are 
problems which fe-nearest neighbor classifiers cannot solve. Figure 1 (a) illustrates one 




Figure 1 : Two pathological problems for RFI. 

such classical problem, the checkerboard. Any problem which is a pathological prob- 
lem for k-nearest neighbor density estimation is a pathological problem for RFI. 

A second class of pathological problems derives from the combining of information 
from multiple clusters or regions in the non-parametric scatter matrices. By construct- 
ing a problem wherein the transformations necessary to optimize the ratio of between- 
class to within-class scatter conflict, the non-parametric scatter matrices’ ability to 
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combine local information can be exploited as a weakness. Figure 1 (b) illustrates such 
a problem. Note, however, that the parametric scatter matrices would do no better with 
the problems in Figure 1 . 

Because non-parametric discriminatory power measures the potential of the features 
for inducing separability between classes, it is desirable that measures of non-paramet- 
ric discriminatory power be invariant with regard to rotation, scaling, and shift of fea- 
tures. Rotational and shift invariance eliminate the impact of irrelevant details of the 
measurement method for the features. Scale invariance eliminates the need for normal- 
ization of the features while preserving the critical information of the ratio of between- 
class to within-class scatter. 

RFI is a function of the eigenvalues and eigenvectors of the parametric and non- 
parametric scatter matrices. While the non-parametric scatter matrices are not as well 
understood as the parametric scatter matrices, the non-parametric forms are still sym- 
metric, as can be seen by observation of equations 5 and 6. Therefore, functions of 
eigenvectors and eigenvalues retain the same properties for both parametric and non- 
parametric scatter matrices. 

Rotational invariance results from the extraction technique; since the optimal fea- 
tures are extracted from the original features, rotation in the original feature space has 
no impact. Scale invariance results from the use of the ratio of between-class to within- 
class scatter; since both within-class and between-class scatter are equally affected by 
scaling a feature, the ratio removes the effects of scaling. Shift invariance results from 
the use of scatter around the means, therefore the technique is self-centering. 

All three forms of invariance reduce to the issue of preserving class separability, 
which is invariant under any nonsingular transformation (including rotation, scaling, 
and shift) [9]. Those transformations affect separability in the individual features {i.e., 
in the marginal feature distributions), but not between the classes themselves. Thus, so 
long as none of the extracted features is discarded, RFI is invariant. 

4 Finding the Optimal Subset 

The optimal subset of features is the smallest subset with the maximum potential for 
separability between classes. RFI extracts a set of optimal features from a set of origi- 
nal features, without the use of classifier-specific assumptions. The optimal subset of 
the original features can be found by maximizing the separation induced between the 
class-conditional joint feature distributions across all possible subsets of the original 
features, as measured using the optimal extracted features. The optimal subset of fea- 
tures is thus the smallest subset of original features which produces the maximum sep- 
aration, measured in the rotated space. 

Given the presence of redundant features, more than one subset of the same size may 
produce the same amount of separation. When two or more smallest subsets produce 
the same amount of separation, and that separation is the maximum separation found, 
then more than one optimal subset exists. The presence of more than one optimal sub- 
set is not a problem; in both assisted and automatic classification, it offers more 
options in the design of the classification system. 

The criteria commonly used in parametric discriminant analysis to find the optimal 
subset of features are not appropriate for the non-parametric case. Criteria such as the 
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trace of the ratio of the between-class to within-class scatter matrices are based on the 
same simplifying assumptions as the parametric scatter matrices. The trace, when cal- 
culated on parametric scatter matrices, is monotonic as a function of subset size, 
reflecting the theoretical assumption that Bayes error also decreases monotonically as 
a function of subset size. 

Under conditions of limited sample size, the monotonicity assumption does not hold 
even for well-behaved data sets with unimodal Gaussian distributions, if the true distri- 
butions are not known and must be estimated. As the number of features increases for a 
fixed sample size, so does the error in the estimation. A second concern is the cost of 
including each feature, in computer time, in complexity, and sometimes in human suf- 
fering, as can be the case in medical diagnosis. In practice, whether for automatic clas- 
sification or assisted classification, more features is not always better. 

A non-parametric approach is to select the optimal subset based on the k-nearest- 
neighbor error in the extracted space. Because kNN error is based solely on proximity 
in feature space, it does not introduce any new classifier-specific assumptions. More- 
over, because kNN error is asymptotically at most twice the Bayes error, calculating 
kNN error in the extracted space estimates the theoretical lower limit on the potential 
classification error [10]. Using kNN introduces a new parameter (k, the number of 
nearest neighbors used to calculate ). Fortunately, as discussed in Section 2, the 
behavior of k is well understood. 

Finding the optimal subset requires exhaustive search, since any non-exhaustive 
technique can do arbitrarily poorly in the general case [11]. The assumption of mono- 
tonicity, necessary for branch-and-bound algorithms to guarantee performance, is 
extremely restrictive, and rarely justified in real problems [12]. Whenever possible, 
exhaustive search should be done. For the purposes of evaluating different configura- 
tions of RFI, or for comparing estimators for non-parametric discriminatory power, 
exhaustive search is required. When applying RFI directly to real problems which are 
too large to execute exhaustive search, sub-optimal techniques must be used. 

Since the criterion, J, used by RFI to estimate the inherent Bayes error in each fea- 
ture subset is a random variable, it is necessary to determine statistically whether the 
difference between separation in the subsets is due to the variance in the learning sam- 
ple or the effects of the subsets. An analysis-of-variance (ANOVA) is performed on the 
results from multiple data sets. Each subset is considered a different treatment for the 
purpose of the ANOVA. 

Calculating J for all possible subsets for each data set reduces the noise in the exper- 
iment. Each data set is a block in a block ANOVA. Calculation of J for each subset on 
a particular data set constitutes the experimental units within that block. The use of 
blocking reduces the noise in the data by reducing the number of data sets for the same 
number of experimental units. Since RFI does not carry over any information from one 
treatment to the next, the concept of order in applying the treatments is meaningless, 
and can be considered to be random. Thus, the model used by RFI is randomized block 
ANOVA. 

A sensitivity analysis was performed to measure the impact of the algorithmic varia- 
tions of Sections 2 and 3 on the ability of RFI to find the optimal subset. The problem 
chosen, (see Table 1) has multiple clusters, mixed distributions, a noise feature, and 




Design Choices and Theoretical Issues for Relative Feature Importance 703 



Table 1 . Sensitivity Analysis Problem 



Feature 


Bayes 

error 

(B,C) 


Cla 

Cluster A 


ss 1 

Cluster B 


Class 
Cluster C 


i2 

Cluster D 


Rank 


1 


37.5% 


U[-3.0, -2.0] 


U[-1.0, 0.0] 


Ul-0.75, 0.25] 


U[4.0, 5.0] 


1 


2 


10% 


U[-3.0, -2.0] 


Ul-0.5, 0.5] 


U[0.3, 1.3] 


U[4.0, 5.0] 


3 


3 


25% 


U[-3.0, -2.0] 


Ul-1.5, -0.5] 


U[-1.0, 0.0] 


U[4.0, 5.0] 


2 


4 


noise 


A (0,1) 


A (0,1) 


A (0,1) 


A (0,1) 


0 



three different ranks of non-noise features. Despite its complexity, the sensitivity anal- 
ysis problem can still be solved using 600 samples per cluster, or 2400 samples in all. 

A full factorial design was used, with two levels per factor. The coding chart for the 
experiments is given in Table 2. Four design points found the correct subset: non-para- 



Table 2. Coding chart for Sensitivity Analysis Part 1 



Factor 


- 


+ 


Within-class scatter 


Parametric 


Non-parametric 


Between-class scatter 


Parametric 


Non-parametric 


Distance measure 


Mahal anobis 


Euclidean 


k value 


1 


5 



metric within-class scatter with euclidean distance, using either parametric or non- 
parametric between-class scatter and either setting for k. 

5 Ranking the Features 

RFI ranks features based on the contribution of the original features to the separation 
in the rotated space. The contribution of the original features to the separation in the 
extracted space can be estimated using the eigenvectors and eigenvalues of the optimal 
transformation. The magnitudes of the eigenvalues measure the amount that each orig- 
inal feature contributes to each extracted feature. The normalized eigenvalues estimate 
the amount of separability contributed by each extracted feature to separation in the 
extracted space. Thus the normalized eigenvalues can be used to estimate separability 
in the rotated space, and the eigenvectors can be used to estimate the amount which 
each original features contributes to that separability. 

The contribution of the original features to the separation in the extracted space can 
be estimated without tuning parameters by using the Weighted Absolute Weight Size 
(WAWS) of [14]. WAWS uses the normalized eigenvalues to measure the contributions 
of the original features to the extracted features by the proportion of separation the 
extracted features contribute to separation in the extracted space. 

Features with statistically distinct WAWS values are given different ranks. To deter- 
mine whether WAWS values are distinct, a second randomized block ANOVA is per- 
formed, and intervals constructed around the differences between treatment means 
using the multiple comparisons formula. Each feature is thus a treatment, and each 
data set (optimal subset only), a block. 

Features with intervals around the differences from all other features are given dis- 
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tinct ranks. Groups of features in which some features have distinct WAWS values, hut 
others do not, are given a single rank. Features not in the optimal subset have rank 
zero. Features (or groups of features) with distinct ranks are ranked based on their 
treatment means, with the largest distinct treatment mean being assigned the highest 
rank. Fligher ranks indicate greater discriminatory power. 

The sensitivity analysis was performed a second time to measure the impact of the 
design alternatives on the ability of RFI to correctly rank the features, given the opti- 
mal subset. Three design points ranked the features correctly (see Table 3 ). Non-para- 

Table 3. Design points which correctly rank the features, given the optimal subset. 



Within-class 

scatter 


Between-class 

scatter 


Distance 

measure 


k value 


Ranking 

Method 


- 


+ 


- 


+ 


+ 


+ 


+ 


+ 


- 


+ 


+ 


+ 


+ 


+ 


+ 



metric between-class scatter is clearly shown to be necessary. The setting for k is, 
again, shown to not be critical, as would be expected. Using parametric within-class 
scatter with Mahalanobis distance also ranks the features correctly, given the optimal 
subset. Thus, the use of Mahalanobis distance compensates to some degree for the 
information lost by the parametric scatter matrix 

6 Complete Algorithm 

In practice, RFI first finds the optimal subset, and then ranks the features within that 
subset. A final sensitivity analysis was performed, using the complete algorithm. Two 
configurations of RFI correctly solved the problem. The only factor that was not criti- 
cal was the number of nearest neighbors. Within-class scatter had to be calculated non- 
parametrically using euclidean distance to find the correct subset. Between-class scat- 
ter had to be calculated non-parametrically to rank the features correctly. To correctly 
rank the features overall (assigning zero to features outside the optimal subset), both 
within-class and between-class scatter had to be calculated non-parametrically and, 
euclidean distances had to be used. Note that the insensitivity to k may have been due 
to the use of uniformly distributed signal values. Earlier research with Gaussian signal 
features has demonstrated greater sensitivity to ^ [15]. 

7 Conclusions and Future Research 

A number of choices were considered and resolved in the design of RFI. RFI must use 
non-parametric scatter matrices for both within-class and between-class scatter, based 
on the results of the sensitivity analysis. RFI selects the optimal subset of the candidate 
features based on their potential for inducing class separability, thus, RFI uses kNN 
error in the rotated space to find the optimal subset. The choice of kNN error was made 
because it asymptotically approaches twice the Bayes error with increasing sample 
size. In addition, kNN error introduces no new assumptions, being based on the 
assumption that proximity in feature space can be used to determine class membership. 
RFI uses randomized block ANOVA to determine whether one subset (or set of sub- 
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sets) has statistically better class separability than the other subsets. 

The design of RFI presented here has been shown to correctly rank features for a 
variety of two-class and multi-class artihcial and natural data problems [8,14,16]. 
Planned enhancements of RFI include incorporation of cost information and categori- 
cal features in the kNN density estimation. In addition, the computational cost of the 
algorithm might be reduced through the application of such techniques as adaptive or 
edited kNN. 
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Abstract. The goal of this work is to propose a general-purpose crossover op- 
erator for real-coded genetic algorithms that is able to avoid the major problems 
found in this kind of approach such as the premature convergence to local op- 
tima, the weakness of genetic algorithms in local fine-tuning and the use of real- 
coded genetic algorithms instead of the traditional binary-coded problems. 
Mathematical morphology operations have been employed with this purpose 
adapting its meaning from other application fields to the generation of better in- 
dividuals along the evolution in the convergence process. This new crossover 
technique has been called mathematical morphology crossover (MMX) and it is 
described along with the resolution of systematic experiments that allow to test 
its high speed of convergence to the optimal value in the search space. 



1 Introduction 

Since its genesis genetic algorithms (GA) paradigm is being increasingly used in 
search and optimization problems. GAs are generally represented as a set (or a popu- 
lation) of one-dimensional strings, called individuals, each of which contains a given 
number of chromosomes and maps a possible solution to the problem. Given an 
evaluation function, GAs approach the optimal solution by applying various operators 
over the individuals of the population. Such operators as reproduction, crossover and 
mutation are the most frequently used [1]. 

Several improvements have been made to GAs during the last decade in order to 
avoid some of the major inconvenients found in this kind of approach. Such problems, 
object of attention in this work are: the premature convergence to local optima, the 
weakness of GAs in local fine-tuning and the using of real-coded GAs instead of the 
traditional binary-coded GAs. 

The problem of premature convergence of the GA to local optima of the objective 
function is tightly related with the loss of genetic diversity of the population, being the 
cause of a decrease on the quality of the solutions found. Several techniques have been 
proposed in order to avoid this problem: half greedy crossover was introduced [2] 
based on delaying the convergence process making it slower. Other approaches are 
related to changing dynamically the probability of mutation: at regular intervals in 
time, the value of the standard deviation of the population is tested, if it is lower than a 
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predefined value, the probability of mutation is increased, getting new random indi- 
viduals and so, more diversity [3]. This approach has the inconvenience of the high 
computational cost derived from the calculation of the standard deviation. Generalized 
binary crossover (GBX) [4] has the important feature that extends dynamically the 
schemata defined by the two parents in order to pick the offspring. The problem here 
is that this expansion may be uncontrolled as it depends on the binary coding of the 
values of the parents. BLX-a [5] is an example of a crossover operator for real-coded 
GAs that has the ability of expand the interval defined by the two parents, but de- 
pending on the previously fixed parameter a, not on the diversity of the population. 

The weakness of GAs in performing fine tuned local search is widely recognized. 
Although GAs exhibit fast convergence at first to a point of approximate solution in a 
search space; when a population reaches a state where it is dominated by the best 
individual, finding a better solution is very difficult, resulting in very inefficient 
search. There are several ways to solve this problem, one of them is to have larger 
population, however it requires extensive computation for each generation. Other 
solution would be combining GAs and other approaches as descent gradient methods: 
first GAs are used to locate a point near the solution. Then descent gradient leads to 
the final solution. The problem here is to know the best point in time to change from 
GA to descent gradient in order to be very efficient and the premise that the derivated 
function of the fitness must exist near the optimal solution. 

Finally a growing number of researchers have come to champion real-coded (or 
floating point) chromosomes as opposed to binary-coded ones. The primary advan- 
tages offered by the real-coded GAs are mainly three. First, real-coded GAs eliminate 
the worry that there is inadequate precision so that good values are representable in the 
search space. Second, the range of parameters does not have to be a power of two. 
Third, real-coded GAs have the ability to exploit the gradualness of functions of con- 
tinuous variables [6]. Several new crossover operators have been designed to work 
with real numbers such as Radcliffe’s flat crossover (RFX) [7]. This operator chooses 
parameters for an offspring by uniformly picking parameter values between (inclu- 
sively) the two parents parameter values. Of course this approach has the premature 
convergence problem. Other important, but also quite computationally expensive, 
crossover technique for real-coded GAs is the UNDX [8] that can optimize functions 
by generating the offspring using the normal distribution defined by the three parents. 

A general-purpose method of optimizing functions using real-coded GAs is de- 
scribed throughout this paper. The GA proposed to accomplish this task employs a 
new crossover technique based in mathematical morphology [9] [10] as so it is called 
mathematical morphology crossover (MMX). In particular MMX, employs the mor- 
phological gradient very used with segmentation purposes [11] and adapting its 
meaning from gray-scale images to the generation of better individuals to reach the 
optimal solution very quickly. The objective when MMX was being designed was to 
dynamically expand or make narrower the interval defined by the parents depending 
on the genetic diversity of the population, allowing this way fine local-tuning capa- 
bilities while avoiding the premature convergence to a local optimum. 

Several optimization problems are devised to test this new approach ranging from 
the optimization of some functions to even the task of training artificial neural net- 
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works (ANN). These have been chosen to allow us to test its accuracy, high speed of 
convergence and fine-tuning capabilities without being trapped in any local minima 
that these problems present. 



2 MMX: Mathematical Morphology Crossover 

MMX works with a population of m individuals with I chromosomes each coded with 
real numbers, so each point in the search space is defined by: 

s=(a„, aj, ..., aj j), where a £ 91. (1) 

The operator MMX works with each gene in the parents independently to obtain the 
corresponding two genes in the offspring. Let Sj, ..., s,, be an odd number of strings 
chosen from the actual population to be crossed, the (nxl )gene-matrix is defined as: 







flu . 






G = 


^20 


'*21 • 


•• ^21-1 


where s, = (aj„, a„ ..., a.^), i = 1 




y^nO 


fl„l . 


•• J 





The crossover operator works with each column y:=(aj;,a 2 j, matrix G ob- 

taining genes OjeSI and o7e91. The result of applying the operator to matrix G is, 
therefore, the two new descendants o =(o„,Oj, ...,o, j)£D,j, and o’= (o„’,Oj’, ...,o,.j’)e D.^,. 
The procedure employed by this crossover to generate the new offspring string o from 
the parents s,, ..., s„ in matrix G is shown in figure 1. The descendant o’ is obtained 
from o as it will be seen from formula 7. 




Fig. 1. Generating the new descendant o from the parents 



2.1 The Morphological Gradient Operator 

In the first step of the algorithm, the morphological gradient operator usually em- 
ployed on digital images /is now applied on each vector / with i = 0, 1, .., 1-1. In this 
case,/ may be considered as a function /:D^* 91, being = {1, 2, ..., n} and/ (j) = aj. 
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The structuring element taken to build the crossover operator is also a function b:D,,*9i 
defined as; 

b{x) = 0, Vx E Z), , D, = {-E(n/2), 0, E(n/2)} (3) 

being E(x) the integer part of x. 

The morphological gradient function applied on^: with the structuring element b is 
defined by g^ (/:). From this function gj is obtained as the value: 

g, = g,(Q(E(n/2)+l) iE {0, 1,...,1-1) (4) 

The morphological gradient applied to images returns high values when sudden 
transitions in gray levels values are detected, and low values if the pixels covered by 
the window (structuring element) are similar. A new interpretation of the morphologi- 
cal gradient has been given when applied on GAs: g; gives a measure of the heteroge- 
neity of gene i in the individuals chosen to be crossed. If value gj is high, the popula- 
tion is scattered, while if it is low, that means the values of that gene are converging. 



2.2 The Crossover Intervals 

This step determines n crossover intervals for each of the « chromosomes of the indi- 
viduals of the population: {Cj, C^, ..., Cj, ..., C„}. Each of the n pairs o^ and o7 of the 
offspring will be taken from the crossover interval Cj. In order to obtain the edges of 
each crossover interval Q, let us define (p as the function (p: > 9i, and the maxi- 

mum gene as: 

girnax=max(/;)-(p(g.) (5) 

While the minimum gene is defined as: 

gM„=min(/;)-Kp(gj) (6) 

The maximum gene and the minimum gene finally determine de edges of the 
crossover interval C. as: C.=[g. , ,g. ]. 

I 1 LOinun’Oimax-' 



2.3 Obtaining the Offspring 

The final result of MMX is the generation of two new descendants: o = (o„,o,, ...,Oj J 
and o’ = (Oj’, 0 ,’, ...,Oj j’). Oj is obtained by randomly picking a value inside the cross- 
over interval C, while o7 is obtained as: 

oi’ = ( min (fi) + max (fi) ) - oi (7) 

This way, the following formula is satisfied: 

o, H- o7 = min (f.) H- max (Q = g.^^ + g.^^ (8) 
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3 How Does MMX Work? 

Function (p, used to obtain the maximum and the minimum genes in formulas 5 and 6, 
is employed to determine the crossover intervals where the offspring is taken from. 
This function carries out an important feature that allows MMX to work successfully: 
the crossover interval may be dynamically extended or narrowed with respect to the 
reference interval [min(f ),max(f )] determined by the parents depending on the genetic 
diversity of the population. When the individuals to be crossed are diverse (which 
implies a high value of the gradient) the crossover interval is made narrower according 
to the reference interval, thus allowing to explore its interior searching for the opti- 
mum much faster. On the other hand, if the individuals to be crossed are very similar 
(gradient close to zero), which means that the population is converging, then it is ad- 
visable to expand the interval [min(/j),max(/:)] to allow the exploration of new points 
in the domain, thus avoiding the possible convergence to a local optimum. This possi- 
bility of expanding or narrowing the crossover interval depending on the value of the 
gradient gj, must be given by the chosen function (p, which must satisfy the following 
premises: 

It should have low computational cost because on each application of the operator, 
(p will be calculated I times (the dimension of the individuals of the population). 

Its domain must fit within the range of chromosomes values. 

It is necessary that in order to build the crossover interval Q. From (5) 

and (6): 

mm(f)+ (p(gj) <max(/:)-(p(gj) ( 9 ) 

Leaving tp(gj) alone: 

tp(gi) ^ ^/ 2 [max(/;)- min(/;)] (10) 

From (4), and using the structuring element defined in (3), it may be obtained that: 

gj = max(/:)- minif) (11) 

So we have from (9) and (10): 

(p[ max(/j)- min(/:)] < */ 2 [max(/:)- min(/j)] (12) 

Obtaining the premise that (p must finally satisfy to assure that 

(p(x) <— , V X £ D 

A range in which function (p returns positive values must exist in order to make the 
crossover interval narrower. It is also necessary another range in which the function 
returns negative values in order to expand the crossover interval. 

If g; = 0 then all the values of chromosome i of the individuals to be crossed are 
equal: 



gi = 0 ^ g, (/:)(E(n/2)+l) = 0 ^fi]) = k, V j £ D^, k £ 91 



( 14 ) 
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Fig. 2. Test functions fl to f4 

Due to this property, tp(gj) must satisfy the condition: tp(0) ^ 0; otherwise, it would 
happen that: 

gimax=rnax(/;)-(p(0) = k ^ 

g«„= min(/;)H- tp(0) = k 

Then the crossover interval is Cj = [k , k], so the value of the offspring gene o^ will 
be always k. In these conditions, if k is not the optimal value for gene i, the algo- 
rithm will be stalled, never reaching an optimal solution. 



4 Experimental Results 

Six different optimization problems have been used to test MMX. These are the opti- 
mization of five different functions (fl,f2, ..., f5) and the task of training an artificial 
neural network to solve the two-spirals problem. Our purpose in choosing functions fl 
to f4 (shown in figure 2) was not to come up with challenging problems, but to choose 
a set of simple functions that enable us to show the major problems presented in other 
approaches that are solved by MMX. Function f5 and the ANN training test are larger 
problems that will show how MMX performs high speed of convergence without 
being trapped in any of the great amount of local optima that these examples present. 
All the crossover operators employed produces two children from two parents, ex- 
cepting MMX, which uses a matting pool of five individuals. In each case, a popula- 
tion of 50 individuals has been used and halted the search when either the minimum is 
reached (with an error less or equal to lOE-4) or the population converged. In order to 
place the emphasis on the effects of each crossover, no mutation was used in either 
algorithm. Parents to be crossed have been taken from the population by the roulette- 
wheel method, the new offspring replaces the worst two individuals of the population. 

In case of MMX, the function (p described in section 3 has been empirically chosen, 
and it is shown in Figure 3. This function, which satisfies all premises previously 
reported, is defined in domain [0,1], so the population had to be normalized in the 
same range. This function only performs one floating point multiplication, so its appli- 
cation in the crossover operator described is very efficient, as it allows the generation 
of a new individual with only / multiplications, tp(gj) is positive when gj is strictly 
greater than 0.2, making the crossover interval narrower in this case. On the other 
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hand cp(gj) takes negative values if g; ^.2, that is, the crossover interval is expanded 
when the population is converging. 




4.1 Failure Tests 

Results given by MMX optimizing functions fl to f4 have been compared to two 
real-coded crossover operators: RFX and BLX-* with •=0.5; and also compared to the 
results reported by the binary-coded crossover operator GBX. Function fl is a simple 
incline problem with the optimum at one extreme: f(x) = x. The second, f2, is a parab- 
ola with the minimum at the center: f(x) = x . The third, f3, has the shape of a V: f(x) 
= Ixl. Finally, function f4 has the shape of function f3 between points x = -0.1 and x = 
0.1, taking the minimum value of 0 at x = 0, and being 1 for all other values of x. For 
these for functions, x ranges from -2 to h-2 -1. 

Results are reported in table 1 where it can be seen for each of the crossover op- 
erators tested the percentage of successful trials over 1500 trials run. RFX is trapped 
in local minima in most of the cases due to it does not extend the interval outside the 
extrema determined by the parents. RFX can not even solve fl problem, but also, RFX 
is the fastest algorithm when it is able to reach the optimum, which confirm the as- 
sumption that it is better to make the crossover interval narrower in the first stages of 
the convergence process. BLX-0.5 is similar to RFX, but BLX extends the reference 
interval, which gives better results than RFX except for function f4. GBX finds some 
problems in function f4 due to most of the points of the domain have the same fitness 
and the algorithm can not find out which ones are closer to the optimum. Finally, as it 
was supposed, MMX is never trapped in any local optima. One of the main reasons 
why RFX, BLX and GBX converge before reaching the optimum is because when the 
population is converging, there is a high probability of getting a matting pool of two 
identical parents, then, the children are also the same individuals, and the population 
quickly converges. There is no way to obtain a child different to its ancestors due to 
mutation probability is set to 0. This effect does not occurs to MMX, which gives 
different offspring for identical parents, so it can explore other points, searching for 
the best solution. 
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Table 1. Results for functions fl to f4 





fl succ. % 


f2 succ. % 


f3 succ. % 


f4 succ. % 


MMX 


100% 


100% 


100% 


100% 


BLX-0.5 


65% 


91% 


83% 


44% 


RFX 


0% 


82% 


68% 


62% 


GBX 


100% 


100% 


60% 


43% 



4.2 Performance Tests 

Function f5 and the task of training an ANN to solve the two- spirals problem have 
chosen as performance tests. f5 is a two variables sine envelope sine wave function 
very frequently used to measure the performance of GAs: f5 is cylindrically symmet- 
ric about the z axis, the global optimum is located at coordinates (0,0), having infinite 
local suboptima in concentric circles if seen from above. 

MMX has been compared to BLX-0.5 and RFX optimizing function f5. These three 
crossover operators have similar computational cost, so performance has been meas- 
ured in terms of the number of crossover operations needed to reach the convergence. 
Results for this test case, obtained from 1500 trials run, are reported in table 2 that 
shows the averaged number of iterations needed to reach the optimum value (with the 
error required), and the percentage of failed trials in the last two columns (the popula- 
tion has converged to a local optimum). The three operators take more or less the 
same iterations to reach the optimum value. The main differences are that MMX never 
fails and it is much faster that the other algorithms in the first stages of the conver- 
gence process due to MMX is the only algorithm that makes the crossover interval 
narrower when there is high population diversity. 



Table 2. Results for function f5 





Averaged iterations 
error*10E-4 error*10E-3 


% Failed trials 
error*10E-4 error*10E-3 


MMX 


16.234 


4.972 


0% 


0% 


BLX-0.5 


17.480 


6.157 


83% 


88% 


RFX 


16.824 


5.356 


94% 


98% 



Training an ANN with the new crossover operator proposed has also been tested 
using the two-spirals problem, where for each of 100 training points belonging to one 
of two interwined spirals, the network must tell in which spiral a given point belongs 
to. This is a hard task for descent gradient and GA methods due to its complicated 
error landscape. In this problem, each of the floating point numbers of the individuals 
in the population representing the weights and biases of the network are randomly 
initialized within the interval [-25.0 ,h- 25.0]. The size of population is 30, which is 
quite small, allowing higher efficiency in computational terms. To evaluate the fitness 
of each individual, feed forward computation is performed by presenting the patterns 
of one epoch and the mean square error (MSE) is calculated. Comparisons between 
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MMX and other approaches based on GAs are not shown due to all other approaches 
were computationally much more expensive that MMX or fell in local optima most of 
the times. However, in this case, MMX has been compared to backpropagation with 
quick-drop (BPQ), one of the fastest variants of backpropagation [12]. 

Table 3 shows the averaged number of epochs needed by BPQ and MMX to reach 
each level of error (It has been calculated that one BPQ epoch is approximately 
equivalent to 220 MMX iterations). Finally, last column shows the percentage of 
failed trails. It can be seen from this table how, although the error landscape is very 
complicated, MMX does not fall in any optima while performing very high speed of 
convergence even when approaching to the optimum. 



Table 3. Results from the two-spirals problem 



MSB 


Averaged epochs 
BPQ MMX 


% Failed trials 
BPQ MMX 


lOE-3 


326 3,74 


47,5% 0% 


lOE-4 


401 4,03 


52,1% 0% 



5 Conclusions 

A new general purpose crossover operator called MMX has been proposed. This op- 
erator works with real-coded individuals and is based on morphological techniques 
allowing faster convergence speed than other approaches based on GAs. MMX solves 
the major problems with GAs: Convergence speed is very high for small and larger 
problems as different experiments demonstrated. It has also to be noticed that MMX 
has been designed to avoid the loss of genetic diversity of the whole population in 
order to prevent the premature convergence to local optima. At the same time, and 
dynamically with the convergence process, MMX performs local search giving strong 
local fine-tuning capabilities to the algorithm allowing high speed of convergence not 
only at the initial stage, but also later by focusing the search. Another important novel 
feature of MMX is that at the initial stages of the convergence process and when the 
there is high genetic diversity, MMX makes the interval crossover narrower, which 
focuses its search, increasing the speed of convergence. 

The experiments shown in this paper also demonstrate that GAs with MMX clearly 
outperforms other crossover operators and other gradient descent methods with a rela- 
tively small population avoiding the problem of falling in a local minimum although 
the landscape error is very complicated. 
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Abstract. In this paper we address the problem of facial expression 
recognition. We have developed a new facial model based only on visual 
information. This model describes a set of bidimensional regions corresponding 
to those elements which most cleai'ly define a facial expression. The problem of 
facial gestures classification has been divided into three subtasks: face 
segmentation, finding and describing relevant facial components and, finally, 
classifying them into one of the predefined categories. Each of these tasks can 
be solved independently using different techniques already applied to a wide 
range of problems. This have led us to the definition of a modular, generic and 
extensible process ai'chitecture. A prototype has been developed which makes 
use of different simple solutions for each module, using a controlled 
environment and a low-cost vision system. We report the experimental results 
achieved by the prototype on a set of test images. 



Keywords. Eacial expression recognition, facial modeling, feature location, 
facial segmentation, facial components. 



1 Introduction 

Non-verbal communication plays a basic role in human interaction. Face or hand 
gestures and voice tone add substantial meaning to communication. In particular, we 
can associate each human emotion to one or more facial expressions. Actually, this 
implicit information is an essential feedback for speakers to know whether the 
audience feel interested, surprised, amused or indifferent to his or her words. Hence, 
depending on this feedback, the speaker will guide his/her words in a different way to 
obtain the desired effect on the audience. 
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Research in facial expression recognition attempts to provide automatic systems 
with emotional information from their users. Tele-teaching is a good example to show 
how this techniques can help computers to be more friendly and useful. Providing a 
computer with a camera and a face expression recognition software, the system can 
follow the student reactions while he or she is studying a lesson and thus guide the 
teaching process to keep a high attention level. Other examples of applications are 
information browsing, VR, games, home safety and eldercare [1]. 

The recognition system presented in this paper proposes a generic and extensible 
solution to the following problem: given an image or a sequence of images obtained 
from a human face placed in the foreground in front of a camera, classify its gesture as 
the most likely one among a set of predefined ones. 



2 Related Research 

Earlier relevant results in the field of automatic analysis of facial expressions, are due 
to Ekman and Eriesen [2] who in the seventies developed their Eacial Action Coding 
System (FACS), widely accepted nowadays. They envisaged that facial movements 
are produced by the activation of 43 Action Units (AUs) which can be measured 
directly from the electrical activity of some muscular regions. Ekman and Eriesen 
defined a set of six basic emotion expressions: surprise, fear, disgust, anger, happiness 
and sadness. The completeness of this set of emotions is still an open debate. 

Most of the ongoing projects are founded on the theoretical principles presented in 
FACS. For example, the Integrated System for Facial Expression Recognition 
(ISFER) developed at the Delft University of Technology [3] includes a set of mod- 
ules, each one implemented using different techniques. Researchers from the M.I.T. 
have designed an alternative coding method called FACS-t which avoids the use of 
heuristics by characterising facial movements in a probabilistic way. An on-line de- 
scription of this system can be found in [4]. 

Neural networks are commonly applied to the classification phase of the problem. 
An example is described in reference [5]. Some other approaches should also be men- 
tioned as the one proposed at the Osaka University based on elastic networks [6], the 
one based on facial flexible models with probabilistic adjust [7] or those based on 
optical flow analysis [8], among many others. 



3 Facial Modeling 

The way a generic face is modelled is an essential question when trying to detect 
facial expressions. Usually, extracting visual information requires to make the model 
fit the input images. Therefore, the model description determines the way the 
information is extracted and its structure. Nowadays, it is possible to find in the 
literature a wide variety of facial models such as tridimensional models of generic 
faces or those based on flexible grids, eingenspaces, muscular activations or 
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characteristic points or regions. 

The model presented in this paper is entirely based on visual information, i.e. bio- 
logical and environmental causes of image formation have not been considered ex- 
plicitly. The proposed model defines a set of bidimensional regions representing those 
frontal facial elements essential for expression recognition. Some possible input im- 
ages for the recognition problem are shown in Fig. 1 . 




Fig. 1. Possible input images for facial expression classification. Key information for human 
face expression analysis can be extracted from the shape and relative position of the eyes, eye- 
brows and mouth. 

Given the previous problem formulation, the model of a human face can be ex- 
pressed in terms of six main components: eyebrows, eyes, nose and mouth. No matter 
how this characteristics are detected, their location and appearance are the only infor- 
mation a human being requires to accurately classify a facial expression. Moreover, it 
is clear that the nose can be removed from the model since it doesn’t contribute much 
to facial gestures. 

Some a priori information is used to complete the model since we know that rela- 
tive positions between the five considered elements (nose is not included) follow some 
fixed rules. For example, the eyebrows will always be placed over the eyes, and the 
mouth under and between them. Fig. 2 shows a hand-made graphical representation of 
the resulting model. It would be possible to find an automatic way to build this model 
by calculating a mean face from a set of examples, but our approximation is enough to 
reveal the underlying ideas presented here. 




Fig. 2. Graphical representation of the proposed 
model. Stripped zones correspond to uninteresting 
regions while non-stripped ones are used to repre- 
sent special shapes in certain relative positions. 



Another important point to be considered is the noise, defined as anything disturb- 
ing the model. Apart from acquisition distortions, it is possible to define different 
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kinds of noise depending on the problem they cause. The following classification 
shows some of them. 

Component grouping. Low resolution, bad segmentation or poor illumination 
which may cause shadows, can lead to non-separable components. 

Component distortion. Shadows produced by facial components, moustache or 
beard have not been explicitly considered in the model. This could blur the shape 
and/or the location of the five relevant components. 

Component occlusion. Facial components might also be partially or totally hidden 
by other elements (fringe, dark glasses, etc). 



4 Performance Criteria 

Usually, within the recognition context, the utmost performance criterion is to 
maximise the percentage of recognition success and subsidiarily, to minimise the 
involved resources. Unfortunately, this is not of much help when trying to design a 
solution. 

We could consider the problem of facial expression recognition divided into a sub- 
task for extracting information from the images followed by a classification subtask. 
Thus, it is possible to define separate performance criteria for each subproblem. 

The feature extraction task gives a measure of how input images fit the model. 
Therefore, we define the following criteria for this step of the recognition process. 

“Good location” criterion. The five selected facial components must be detected 
in positions as close as possible to the real ones. This classical performance 
criterion was introduced by Canny in his edge detection works. However, in our 
case, accurate location is not referred to points but to regions corresponding to 
eyebrows, eyes and mouth. 

“Good shape description” criterion. The chosen shape descriptor should be as 
simple as possible but complete enough to pick up all significant information from 
the facial components. For example, in order to describe an eyebrow it could be 
enough to know its height, width, inclination and global curvature. 

These criteria are quite difficult to quantify, but they can help us to design an ade- 
quate feature extractor for the proposed facial model. 

For the classification phase we use the well known criteria of maximising the num- 
ber of correctly labelled examples. 



5 Design of the Facial Expression Recognition Process 

The design of the recognition process is based on the model and the criteria previously 
defined. The aim is to build a modular, extensible and generic architecture which will 
be evaluated by implementing a prototype. 
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We consider an architecture to be generic when only the responsibilities of its mod- 
ules and not their explicit contents, are defined. Thus, modules can be grouped or 
isolated as their responsibilities converge or differ, respectively. 

Most current approaches implicitly divide the recognition process into three main 
tasks: face segmentation, feature location and extraction, and classification. We ex- 
plicitly consider this decomposition and propose the schema shown in Fig 3. 




Fig. 3. The facial expression recognition process. Notice the dotted boxes used to group the 
tasks into three main processes: segmentation, detection and classification. 



5.1 Face Segmentation 

Segmentation consists of removing non-interesting areas from images in order to sim- 
plify the problem. Thus, given the model shown in Fig. 2, we must draw out all 
stripped regions from the input images. Segmentation comprises the following phases: 
find the position, orientation and size of the face in the image, and remove non- 
interesting regions from the input images. 

Finding a face in an arbitrary image is not a trivial task. This problem has been 
widely investigated and some solutions have been proposed. Most of them make use 
of colour segmentation techniques, as the one presented in [9]. 

After that, removing non-interesting pixels can be accomplished effortlessly ap- 
plying a mask to the input image. This mask corresponds to the stripped areas of the 
model and must be properly translated, scaled and rotated to fit the detected face. 
Alternatively, we could normalise the input images and use a fixed mask. 
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5.2 Detection of Facial Components 

Once the face has been segmented from the image, it is necessary to find those regions 
corresponding to the eyebrows, the eyes and the mouth and to represent them using 
proper shape descriptors. On the one hand, it is known that those five facial compo- 
nents are located in certain relative positions in the face. On the other hand, after seg- 
mentation, the position, size and orientation of the face are known. Given these two 
facts, the regions where to seek for this components are restricted and approximately 
known. 

The proposed method is based on the extraction of some points of interest using 
low level techniques and afterwards grouping them into the five facial components. A 
possible solution to find the points of interest could be to threshold the image. How- 
ever, as uniformity of intensity is not granted through the face, it seems more appro- 
priate to use an edge detection technique which can make use of the fact that the five 
components and only them are darker than the rest of the face. 

Once the points of interest have been extracted, to group them into components is 
an easier task, although not trivial. From the model, it is obvious that five and only 
five components must be found and their most likely positions are known. Thus, the 
grouping process is mainly guided by the a priori information extracted from the 
model. Nevertheless, some problems as spurious points or non-separable components 
must be carefully taken into account. 

After having grouped the points into the components, adequate shape descriptors 
must be computed to reduce the input information for the classifier. A simple solution 
could be to describe each region with one or more characteristic points. 

An alternative to the extraction of the points of interest could be the use of deform- 
able contours or snakes [10]. They would be initially located on the a priori position 
of the components and would incrementally fit the areas of maximum intensity gradi- 
ent. Thus, the tasks of component grouping and description will be accomplished 
simultaneously. 



5.3 Feature Classification 

Different kinds of output could be considered as the result of the recognition process, 
such as the opening degree of the eyes or of the mouth, the probability of a certain 
facial expression, etc. Here, we shall consider the output of the classifier, i.e. the class 
to which the input image belongs, according to the classification criteria. 

Classification is a generic and widely studied problem in A.I. Linear discriminants, 
nearest neighbours methods and neural networks are the mechanisms most commonly 
applied to solve it, although there exist lots of others [11]. All of them, include a vast 
set of techniques. Nevertheless, they all need to work with a set of features represen- 
tative enough of the selected classes. Selecting proper features as the input to the clas- 
sifier is as important as the quality of the classifier itself. 

For the set of features to be representative enough, a huge amount of data might be 
needed. Some classifiers can cope with high dimension inputs, but if the number of 
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shape descriptors parameters is too high, a problem of undertraining can arise. To 
avoid it, it is possible to introduce an intermediate stage for reducing the extracted 
information to a small and meaningful computed set of features. Data mining tech- 
niques could be applied to automatically perform this extraction. 



6 Prototype Implementation 

The implemented prototype is a fulfilment of the architecture described in 5, showing 
its practical viability. It is not our objective to build a complex system, but to select 
different simple solutions for each module. The prototype has been tested in a con- 
trolled environment. 




Fig. 4. Mask and edge detection examples, a) Predefined segmentation mask, b) Segmented face 

using (a), c) Edges found in (b) using the Canny operator, d) Edges found in (b) using the 

Prewitt operator. 

The prototype makes use of a simplified definition of the problem given by the as- 
sumption of fixed position of the face within the image: the position, size and orienta- 
tion of the face are considered to be fixed and known through all the images taken 
from the same individual. This supposition makes trivial the segmentation of the face. 
Removing non-interesting regions is carried out using always the same a priori de- 
fined mask. Figure 4a shows the mask applied in the tests. 

Facial component detection comprises two phases. Firstly, the points of interest are 
searched for. Then, they are grouped into components and characterised using proper 
shape descriptors. Edge points have been used as points of interest as its validity was 
justified in 5.2. Figures 4c and 4d compares the edges retrieved by two operators. 
Although Canny’s edge detector is the most adequate in most applications, in our case 
a simpler edge detector has been applied. We have chosen Prewitt’s operator, since it 
is simpler to compute and we are interested in detecting regions instead of finding 
edges. 

A flexible and elegant way to solve the problem of grouping the points of interest is 
to apply a mixture model using the EM algorithm to adjust its parameters. We assume 
that edge points correspond to a random sample from a bidimensional probability 
function made up by a mixture of five gaussian probability functions, one for each 
facial component. Using bidimensional gaussians as basic components of the mixture 
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is equivalent to use elliptical shape descriptors. This choice achieves a good ratio 
between the information they supply and the descriptor complexity. Given the edge 
points, to determine the gaussian parameters, i.e. mean and covariance matrix, we 
have applied the iterative EM algorithm. Our implementation of the algorithm in- 
cludes the noise treatment proposed in [12]. 

For the classifier, a simple solution has been chosen: a nearest neighbour method 
using Mahalanobis distance. Before classification itself is carried out, a feature vector 
is calculated in order to reduce the amount of information the classifier has to deal 
with. Six features have been empirically defined by analysing the way the five facial 
components change through the different facial gestures. Specifically, the system 
computes from the gaussian parameters previously obtained, the mouth width and 
opening degree, eyes opening degree, eyes-eyebrows distance, eyes-mouth distance 
and eyebrows angle. 



7 Tests and Results 

For the acquisition of test images we have used a low-cost videoconference camera. 
These kind of devices are the most commonly used in tele-teaching environments and 
its quality is good enough for our purposes. Recorded images have a 160x120 pixels 
resolution using 256 grey levels. 

A set of six basic facial expressions has been defined: normal, sleeping, smiling, 
surprised, yawning and angry. This expressions are expected not to be forced or exag- 
gerated but natural. Despite of this fact, they must be non-ambiguous for a human 
observer. 

From each class, we have recorded 10 examples under the same lighting conditions 
and 5 more in a different moment and conditions, all of then from the same individual. 
Half of the first ones have been used to train the system, while the other 10 have been 
used as test images. Faces are supposed to be located in a fixed position within the 
images, but in practice there exist a 5% mean deviation in face position in the first set 
of examples and an 8% in the second set, with respect to face height. 

Global classification results are shown in Table 1. In each cell two values are 
shown: the left ones correspond to the rate achieved on the set with same illumination 
conditions, while the right correspond to the different conditions set. 



Table 1. Results of classification tests 
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As shown in Table 1, the system achieves an 80% of successfully classified exam- 
ples from the first set, while a 73% on the second set of images. In the first set classes 
are well differentiated and higher confusion appears between ‘normal’ and ‘angry’ 
expressions. The second set performs worse, and errors are more irregularly distrib- 
uted. Some other simplified tests have been performed over sequences of images and 
colour images. These tests exhibited similar results. Three examples of execution of 
the prototype are shown in Fig. 5. 

Although the prototype was not optimised in execution time, it is appropriate to 
emphasise the high speed achieved. Thus, a complete execution, including reading the 
input image from disk and its graphical representation on the screen, takes about 0.35 
seconds in a K6 processor working at 350 MHz. 




Fig. 5. Three prototype execution examples. Left: input images; middle: face mask and edge 
points; right: resulting shape descriptors. Top to down: laugh, anger, sleeping. All examples 
were successfully classified by the prototype. 



8 Conclusions 

In this paper we have addressed the problem of facial expression recognition. We 
propose a facial model entirely based on visual information and not on biological or 
tridimensional models of the face. As a result, we have designed a model based on 2D 
characteristic regions corresponding to the five most relevant facial components. 

The developed process is highly based on this model. First, the division between 
interesting and non-interesting regions leads to a segmentation phase which extracts 
the exact position, size and rotation of the face in the image. Secondly, the model 
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enumerates a reduced set of components which unequivocally determine the facial 
gesture. This second stage comprises the location and description of these compo- 
nents. Finally, the classification process uses shape descriptors to make its final deci- 
sion about the facial expression. 

The result is a modular and generic process definition. The modules pursue clearly 
differentiated, non-overlapped and generic objectives which comprehend a wide range 
of research areas more than a reduced set of techniques. Consequently, the prototype 
implementation allowed us to face the problem from a more practical point of view 
and to experiment with the designed process. The prototype is a fulfilment of this 
architecture. Implementation decisions were based on the a priori defined criteria for a 
good characterisation of the input images: component location and shape description. 

The final percentage of successfully classified examples is over 75%. It undoubt- 
edly reflects the degree of simplicity/complexity of the set of test images used. Nev- 
ertheless, the fact that the system achieves very similar results for both sets of input 
images, using same illumination conditions and different ones, is a good evidence of 
the viability of the implemented scheme. 
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Abstract. The hypothesis is that in the lowest hidden layers of biological systems 
"local subnetworks" are smoothing an input signal. The smoothing accuracy may serve 
as a feature to feed the subsequent layers of the pattern classification network. The 
present paper suggests a multistage supervised and “unsupervised” training approach 
for design and training of multilayer feed-forward networks. Following to the 
methodology used in the statistical pattern recognition systems we split functionally 
the decision making process into two stages. In an initial stage, we smooth the input 
signal in a number of different ways and, in the second stage, we use the smoothing 
accuracy as a new feature to perform a final classification. 



1 Introduction 

Large number of hidden layers is the essential feature of modern pattern 
classification systems based on neural networks [4, 8, 10]. Most popular technique in a 
multilayer perceptron (MLP) classifier training is a gradient descent based back 
propagation (BP) algorithm. However, the algorithm suffers from long learning times 
required to obtain a solution. More complex algorithms such as conjugent gradient, or 
quasi Newton methods, are faster, however, are more sensitive to local minima 
problems [9]. Two typical operations performed in each MLP neuron are a weighted 
summation and a nonlinear transformation. While training a MLP the error signal 
propagates from an output layer back to hidden layers and is used to update the 
weights of the network. A change in the weights is proportional to the error signal and 
to the gradient of the cost function calculated over a number of neurones in the upper 
layers of the network. With an increase in the number of iterations MLP weights are 
increasing and lead to diminution of the gradients. Thus, irrespective to a learning set 
(empirical) error, the training process slows down. This problem is especially 
noticeable in the feed-forward networks with a high number of hidden layers. In 
complex real world pattern classification problems, however, the high number of 
hidden layers is a typical situation. As a result, the number of hidden layers and the 
non-linear character of the activation function constitute main reasons of the slow 
training speed. 
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Propagation of the error signal for a great number of layers is not probable while 
training biological neural networks. We suppose that the weight change in such 
systems is performed by a simpler mechanism, and hypothesize that in lower hidden 
layers of biological systems “local subnetworks” are smoothing input signals. The 
smoothing accuracy may serve as a feature to feed the subsequent layers. In this paper 
we suggest a multistage supervised and "unsupervised" training approach for design 
and training of complex multilayer feed-forward networks. We smooth the input 
signals in a number of different ways and use the smoothing accuracy as a new feature 
supplied to the next network's layer. This approach has much in common with 
traditional techniques in pattern recognition, where input layers are utilized for feature 
extraction. 

Our argumentation, however, is rather different: we do not use artificial 
mathematical methods (spectral, cepstral features, coefficients of autoregression or 
moving average model coefficients, one or two dimensional Gabor, Fourier filters, see 
e.g. [5]) for the feature extraction. Instead, we suggest to train the input layers of the 
network to smooth the signal and to calculate the smoothing accuracy. We use 
selected typical signals to train the hidden layer subnetworks for signal prediction. 
Later in the course of the algorithm prediction accuracy is used to form the new 
features. The number of the selected typical signals determine the number of features 
utilized in the final recognition stage of the decision making network. In order to 
explain the main idea, below we present the utilization of our multistage supervised 
and unsupervised training algorithm to solve a problem of classification of 
electrocardiografic signals. To make our explanation as simple as possible we analyze 
a relatively simple network, consisting of several hidden layers and one output layer. 
The first hidden layer performs signal smoothing. This layer is not trained - it is 
defined apriori. The upper hidden layers are trained to predict the signal from its 
several adjacent values and to estimate the prediction accuracy. The output layer 
performs the final classification and is trained by the conventional BP technique. 



2 ECG Classification Problem 

The ECG classification is vital in determining susceptibility of an individual to 
sudden cardiac death. Analysis of an electrocardiogram (ECG) up to the last decade 
has been grounded on examination of low frequency information (P,Q,R,S,T 
waveforms; position and length of some intermediate segments). Higher frequency 
information, which is usually less in energy, was excluded from analysis because of 
the relatively low accuracy of available recordings. The increase in accuracy of 
recordings, as well as in computer power, has encouraged seeking for information on 
the state of heart among the higher frequency oscillations (e.g. presence of late 
potentials in ST segment have been discovered and their prognostic value proven [2]). 
The current study is focused on examination of the high frequency oscillations found 
on a T wave, which by itself has attracted enormous interest of cardiologists during 
the last few years. Possibility of the high frequency component analysis is provided by 
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prediction 



hidden layer 




Fig. 1. The training schema of the multilayer network for signal classification 



high accuracy of the available ECG recordings (12 bits, 2kHz discretization). Intrinstic 
properties of the high frequency oscillations are considered to be represented by 
autoregression models. Neural networks are supposed to discover informative 
components, which normally are mixed with noise arising during the ECG recording. 

The two classes to be delineated by the proposed method are: myocardial 
infarction (MI) patients who have had the complication of ventricular fibrillation (VF), 
as opposed to those whose MI was not complicated by the VF. The task of 
distinguishing risk of life threatening ventricular fibrillation is important, but no 
reliable solutions have been achieved as yet [11]. 
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3 The ANN Training Strategy and the Architecture of Network 



The proposed network consists of the hidden layer and the output layer. The 
sequence of layers is selected to resemble feature extraction procedures performed in 
living systems. The specific characteristic of the network is its step by step training 
algorithm introduced instead of the overall BP training procedure. 



The hidden layer. Here, in the first step, smoothing of the original signal 
X={Xj, is performed: 



smoothed 



R 



X ViXj_i , 
i=-R 



( 1 ) 



where v„, ... , v„ are values of a smoothing window, such that 

R 

Xv, =1, and R represents the window width; 1=1,2,..., n. The values of parameters 

i=-R 

v={vj which form the weights of the first network layer are chosen task- specific, no 
training is done. 

In the second step, the high frequency signal AX={Avj, Ax^,..., AxJ is extracted 
from the original signal, using the smoothed one: 



Ax-=x -x / = 1 2 n (2) 

No training is required in this step. 

In the third step, a set 5 = (X, ... , S^} of “typical” high frequency signals 

from both pattern classes is selected, and the corresponding set of linear neural 
networks is trained for the signal prediction: 



Ax 



predicted _ 



i=M 

X 

i=-N j¥0 



Wi Ax 



i+j 



( 3 ) 



where w. ‘ is the 1-th weight of the k-\h prediction rule, j=\,2,...,n, k=l,2,..., p. 
Parameters h’={h’_ form the weights of the second hidden layer. In the classification 
phase, p units perform prediction of an input sequence. As a result, we have p 
sequences j =1,2,..., n-N-M, k = 1, p. Together, all three steps 

perform a weighted summation. 

In the fourth step, “similarities” y^, y^, of a sequence to be classified 

AX={ Avj , Avj ..., AxJ to the typical signals S^, S^, ... , are calculated: 

= /(ax,- - Axi ^’ , 1 = l,2,....n, z ) , (4) 

where Ax denotes x,. predicted using the weight set w.*, k=\,2,...,p, j 
= - N, -N+1, ... , -1, 1, 2, ..., M; z denotes a parameter set for the function/. The 
function /includes averaging of the differences Ax,- Ax‘' ”""“in partially intersecting 
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Fig. 2. The original T wave (solid curve) and the smoothed signal (dotted curve) by means of 
triangular window with a basis (-20 20). 

subintervals of the original signal and finding the median value of these averages. 
Parameters z control the averaging process. 

The information processing schema of all four steps is summarized in Fig. 1. 
Actually all four steps perform a weighted summation of the input signals and a 
nonlinear operation. In this sense, in a classification phase, an information processing 
procedure of all four steps can be realized by a non-linear single layer perceptron 
(SLP). The non-linear SLP also performs a weighted summation of input signals and a 
nonlinear operation [6] . 

The output layer is trained by conventional BP algorithm to assign the set of 
values jj, y^, ... ,y^ to the specific class. Entire recognition algorithm has been shown 
in Fig. 1. The information processing schema corresponds to that of one hidden layer 
MLP. 



4 Application of the Algorithm to ECG Analysis 

The last achievements in ECG analysis are dominated by features which are told 
not to be seen by a naked eye. This concerns the mentioned earlier high frequency 
components [16] and dynamical patterns arising in the long run of an ECG [3]. In 




Til S. Raudys and M. Tamosiunaite 



analysis of the class of ECGs with ventricular fibrillation as opposed to those without 
VF, high frequencies of the cardiosignals are guessed to contain significant part of 
information, that should be used for classification purposes. Therefore, it is worth 
analyzing the difference between the original and the smoothed signals Av 

= as given in the algorithm above. 

In our ECG classification problem, the weights of the first layer of the network 
(eq. 1) were fixed apriori: we used a triangular window with v. 

=/?-[/l+l, j=0,±l,±2,...,±/?. An example of the original and the smoothed signal of one 
particular T wave is presented in Fig. 2. 



To extract the useful information for classification, the network should calculate 
similarities y^, y^, ... , y^ of the sequence AX to be classified, to p typical T waves 
selected from both classes. For this purpose in the training process of the overall 
neural network we design p prediction rules, one for each of the selected T waves. 

In this particular pattern recognition task we have chosen to use a simple linear 
prediction rule (3). An example of the difference signal AX (solid curve) and the 

predicted signal ^predicted curve) is presented in Fig. 3. The weights used 

for prediction are derived not from the signal itself, but from the other T wave, 
selected as typical during the training procedure. It is evident that the difference 
between these two curves in the given example are minute, prompting the suggestion 
that the given signal could belong to the same class as the selected prototype. 




Fig. 3. The difference between the smoothed signal and its prediction by means of a linear 
equation = -0.2350 + 0.6849 j + 0.8204 -0.2663 -1-1.7139. 
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In the classification phase, for a number (say J) of partially intersecting 
subintervals of the T wave we calculated J averages, of absolute values of deviations , 
i=l,2,..., L where L is the length of the subinterval. Let us call the average over the 

subinterval . Mean values yt=y[ of the average prediction errors evaluated over all 7 

subintervals of the T wave of the ECG signal were used as the new features. Above, 
index k denotes a current number of the typical signal, k= 1,2,. . ., p. In order to 

obtain robust estimates we utilized a sigmoid function to reduce contribution of largest 
( = 10%) deviations. A sequence of new parameters jj, y^, , y^ was used to 

make a final classification rule by means of a conventional single layer perceptron. 



5 Experiments and Results 

To verify the compound network’s architecture, and the weights estimation schema, 
we used two category data set composed of T waves obtained from 59 myocardial 
infarction (MI) patients who did not have the complication of ventricular fibrillation, 
and 43 ones who had suffered this problem. For training we used 29+22 patients and 
for testing - the rest 29+21 patients. After a couple of experiments with R=2Q and 
7?=60 for the signal smoothing we selected R=2Q. After the signal smoothing and 
finding the differences A x. = - x. in order to find the weights for each of the 

51 individual T waves we trained the linear single layer perceptron (3) to predict a 
“middle value” from N+M values of the signal, and measured the prediction accuracy 
evaluated in this particular T wave of the ECG signal. 23 T waves with the highest 
prediction accuracy were selected as “typical” (p = 23). The threshold for selection 
was: 



<0.13 ^( Ac;, (5) 

where s(Ar.) is a sample standard deviation of absolute values I 

j=l,2,..., n. Then the particular set of weights ... , vv‘.,, w\, ... , 

of prediction equation (3) was selected as parameters of the neurons of the 
second stage of the network. In our pattern classification problem, we tested M = 
A=l,2, 3, 4 and 5 and found M = 2 was enough to obtain sufficiently high prediction 
accuracy. We selected 14 “similarity” T waves from the first pattern class and 9 from 
the other one. Thus, our network contained 23 hidden units that calculated 
“similarities”, the new secondary features of the signal. 

As the decision making schema in the output layer, we selected a simple single 
layer perceptron (SEP) trained by a back propagation algorithm. For training we used 
5 T segments of each from 29+22 patients investigated. For testing we also used 5 T 
segments of each from the rest 29+21 patients. Thus, for training we used 255 23- 
variate vectors, and for testing we had 250 vectors. 
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To speed up the training process and to improve generalization we began the 
training from zero initial weight vector of the perceptron, used moving of a mean of 
the training data to a center of coordinates, rotated and scaled the data as it was 
suggested in [14,15]. In order to obtain the minimal classification error, we utilized an 
“antiregularization” technique described in [13]. To increase the magnitudes of the 
perceptron’ s weights and to force a cost function to minimize an empirical frequency 
of misclassifications we added to the standard sum of squares cost function a positive 

T 2 

term + 0. lx(TT- 100) and stopped training optimally (after 4100 batch iterations). 
P geniraihaiion ~ P te« ~ 0.304 was achievcd. A classification matrix of the test set is 
presented in Table 1 (left part). 

Table 1. The classification matrices of the test set (left - without voting, right - with voting) 



Pattern class 


1 


2 


Pattern class 


1 


2 


1 (without VE) 


127 


18 


1 (without VP) 


28 


1 


2 (with VP) 


58 


47 


2 (with VP) 


10 


11 



As it was noted before, in our investigation, each patient was characterized by 5 
ECG T segments. Therefore, further improvement in the accuracy was obtained by a 
voting procedure: the generalization error was reduced down to 22% (Table 1, right). 
In comparison with conventional MLP training, the achieved accuracy is much higher: 
earlier, while classifying individual ECG T segments we obtained 37-39 % of errors, 
just slightly better than a randomized classification according to the class prior 
probabilities, = 0.58 and = 0.42. In general, the achieved level of accuracy in 
sudden death recognition is reasonably high according to the standards of nowadays 
medicine. 



6 Concluding Remarks 

Above we presented the simple example that illustrates our approach to training of 
complex feedforward neural networks. Instead of training all layers of the network by 
the conventional BP technique we are trying to simplify the training process 
overloading a part of the work to the deterministic selection of weights and to 
unsupervised training. In the first layers of the proposed neural values of neighboring 
components of input vectors are predicted, while in the later stage, prediction accuracy 
is evaluated and final decision is made. 

In the feature definition phase, we do not use the class membership indexes. 
Instead, we use the neighboring information from the process under investigation in 
order to determine the weights. Averaged errors of prediction of neighboring cells 
serve as features, supplied to the higher layers of the network. Some neurobiological 
evidence exists that such type of information processing is characteristic to “natural 
information processing systems” [12]. Moreover, the decorrelation of the inputs 
technique utilized while training the output layer of the network agrees with the results 
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of investigations of visual cortex in biological systems. In an analysis of retinal 
ganglion cells, it was proposed that the purpose of retinal processing is to improve the 
efficiency of visual representation by recording the input into a spatially decorrelated 
form [1]. A bit later the decorrelation dynamics of lateral interaction between 
orientation selective cells in the early visual cortex was shown to yield quantitative 
predictions about orientation contrast, adaptation, and development of orientation 
selectivity which are in good agreement with experiments [6, 7]. 

Our analysis of the complex cardiology problem has shown that this simple 
method allowed to achieve rather high classification accuracy only on basis of ECG 
signal’s T wave measured for each patient. The achieved accuracy is considered high 
in the field of ECG analysis for sudden death prediction. 

Our main goal was to illustrate that the very simple network can be used to solve 
rater complex real world problem. Eor this we need to decompose a global task into 
separate steps and perform training for each step separately. The solutions obtained 
are more stable, they do not require to make assumptions about mathematical models 
of the data. While training we have milder local minima problem, the number of 
weights to be learned is much smaller than while training the MLP classifier. Thus, 
from the theoretical point of view, such training strategy should require less training 
samples and should lead to better generalization properties. 

Furthermore, the architecture of the network can be expanded. Instead of the 
simple triangular smoothing one can use more complex window function. Instead of 
the plain linear prediction in the feature definition stage, one can use complex MLP or 
Radial Basic Functions (RBF) networks. For final decision making one can also use 
MFP or RBF networks. The approach should be adapted specifically to each real 
world problem in question. It has been successfully used for identification of EEG 
states and financial times series prediction. In the latter case, instead of using SEP in 
input layers and output ones we were obliged to use MLP. The MLP classifiers in 
input layers were trained using different subsets of training data selected by a special 
neural network based algorithm and a financial analyst. 
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Abstract. A technique is proposed for choosing the thresholds for a number of 
object detection tasks, based on a prototype selection technique. The chosen 
prototype subset has to be correctly classified. The positive and negative objects 
are introduced in order to provide the optimization via empirical risk minimiza- 
tion. A Boolean function and its derivatives ai'e obtained for each object. A spe- 
cial technique, based on the fastest gradient descent, is proposed for the sum of 
Boolean functions maximization. The method is applied to the detection task of 
house edges, using its images in aerial photos. It is shown that proposed method 
can be expanded to solving of a wide range of tasks, connected to the function 

optimization, while the function is given in vertices of a 2° single hyper - cube. 



Keywords: Prototype selection. Sum of Boolean function optimization. Edge 
detection. 



1 Introduction 

A wide variety of image processing tasks such as images compression [17], image 
matching for motion estimation and 3D modeling [2], object decipherment [14], edge 
detection [4] and so on, demands in its final stage a threshold selection procedure to 
distinct between correct and incorrect results. Usually these thresholds are supposed 
to be pre-defined and no techniques are given for their correct choice, in spite of 
strong influence of these thresholds on the final results. It was shown in previous 
works [14], that correct parameter estimation can be achieved using two sets of sam- 
ples: a set of positive samples (objects) E = {e[,e 2 ,...,ej^}and set of negative samples 
(anti - objects) A = |ai,a 2 ,...,ap|. Sometimes [17] it can be found an automatic pro- 
cedure for object and anti - object detection, however in common case samples are 
pointed out manually. Every object e;,aj, i = l,...,m, j = l,...,pcan be represented by n 

- dimensional vector of features (pj,{p 2 >---> 9 n ’ where e Rj is supposed to be a real 
value from limited interval [(p|^,(p”“]. Without loss of generality, let us assume that 
for each i E[(pj (e)] > E[cpj (a)] , where E[(pj(E)]and E[(pj(A)]are average values 
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of feature tpj distribution for samples of an object and an anti - object. Under this 
assumption thresholds t; = min {pj^e; )i = 1 ,.., n provide correct classification for all 

objects and misclassification for anti-objects with values > tj. In the other 

words object e^ = , where X^. is value of feature tp., j = l,...,nfor 

object 6j , will be correctly classified if for each feature tp^ value X^. it exceeds or 
equals to the threshold tj , i. e., Vj[x,j ^ tJ. To simplify the expression let us intro- 
duce a function similar to the Kroenecker notation: 

5(X, t) = 1 if X > t and 0 oterwise. ( 1 ) 

Using (1) the expression for true object classification can be rewritten as 

All 

j=i 



(2) becomes one for true classification and zero for misclassification. 

The problem is to define optimal thresholds tj, j = l,..,n in the sense of empirical 

risk minimization [20], i. e., to minimize functional 






1 - 



fl5(Xj(ei)>tj) 



j=i 






( 3 ) 



where X j (e; ) and Xj (a; )are values of feature tpj for a object and an anti-object. The 

first sum of (3) expresses the misclassification rate and it changes in the points 
tj =Xj(e;). When tjis increased the sum is reduced and vice versa. The second 

sum of (3) expresses the false alarm rate. It changes in the points tj =Xj(a;)and it 
decreases when tj is increased, so if there exists a situation 

Xj (ej[ )< Xj (a;i )< Xj(a; 2 ...Xj (a;]j )< Xj(ej 2 ) there is no reason to choose any 
one of Xj(a;i, )in order to minimize (3). Thus, it is enough to considerate only values 
X j (e; ) . If for any feature tpj value t j > X j (e; ) is chosen e, will not be correctly clas- 
sified. Hence, the following approach can be used to threshold selection. There has to 
be selected a subset E'cEof objects Ethat should be correctly classified using 
thresholds 

tj=minXj(ei) j = l,...,n (4) 



and it has to minimize sum (3). Using this definition the task of threshold selection 
can be considered as a task of prototype selection (PS). 

The PS task has a rather long history. In the works of P. E. Hart [9] and G. W. 
Gates [8] the PS task was defined as a task of selection the most representative pro- 
totypes for k-NN classification. To overcome the problems of classification quality 
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a hypothesis of local statistical compactness was proposed [6]. There is a number of 
algorithms in this framework [11], [3], [13]. Some of them e. g., [13] allow combina- 
tion of features (genes) to generate new prototypes. 

Other approaches can be borrowed from the feature selection task. In expression 
(3) positive samples e^can be treated as a feature set and negative samples can be 
used for feature selection. All of approaches in the framework of feature selection can 
be divided into optimal and sub-optimal approaches. The optimal approach involves 
exhaustive search [10] and mathematical programming approach [7]. The exhaustive 

search of all available prototype combinations demands 2™ L recognition operations, 
where m is the initial number of prototypes and L is the test sample volume. Evi- 
dently it is impractical in case of (3) minimization. The mathematical programming 

approach demands a2“L operations, where a e [0.008...0.04] also is not applicable 
for (3) optimization, since L = m + p can be very large. 

A sub-optimal approach supposes an existence of a finite set of optimal vertices in 
a m-dimensional hyper-cube, where each vertex corresponds to a combination of the 
prototypes. Let us assume, that binary vector b = {bj ,..., bj^ } corresponds to the initial 
prototype set E. If bj = 1 prototype e;is included into optimal subset E' . Using 
random generation of vector b and it substitution into (3) one can choose an optimal 
E' , related to the minimum (3). If there are optimal subsets E' with the same 

value (3) the probability to reach one of them is p = and hence [15] it is 

enough 

T = -2”ln(l-ii)/N„^ (5) 

runs of a search algorithm to reach a global minimum with probability r] . There are 
two main directions in the framework of the sub-optimal approach, yielding a single 
solution of the problem. They are deterministic and stochastic approaches [10]. The 
stochastic approach [1] is based on a punishment and encouragement technique for 
different subsets. It is similar to genetic approach [13] and actually converges to an 
adaptive random search. The family of deterministic algorithms [12] can be repre- 
sented for (3) optimization by forward and backward methods, which are based on 
sequential deleting/addition of an appropriate prototype. 

Sometimes as for 1-NN classification [15] functional (3) can be expressed analyti- 
cally in respect to probabilities of prototypes present in the optimal subset, and the 
optimal subset can be found using classical optimization procedures [3]. The purpose 
of the work is to show that this approach is also possible for (3) minimization and so 
one can derived an algorithm that combines the advances of the above mentioned 
approaches. 

In Section 2 an illustrative example for n = 1 is given. In Section 3 an algorithm 
for the common n-dimensional case is developed. Section 5 includes experiments and 
conclusions. 
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2. One-Dimensional Case 

Let us consider an illustrative example of threshold selection. Let n = 1 be the number 
of features, E = {ei,..,ejj,} is an initial prototype set, where prototypes are sorted up 
so that (p(ejj, )> (p(ejj,_i )>...> (p(ei ) and A = jaj,.., apjis a set of negative objects 
which are also sorted up. If threshold t is chosen using (4) e^is recognized correctly 
if and only if e^ e E' . Prototype e 2 can be classified correctly if e^ or C 2 are included 
into E', since (p(e[)<tp(e 2 )and so on 6; can be recognized if one of ej,j<i is in- 
cluded into E' . Using a set of Boolean variables b = {b[,...,bjj,} , where each b; is 
associated with ejthe condition for C; correct classification can be written as 

f(ei)=bi vb2V...bi = ^'_jbj and for negative object f (a; ) = bj . Since the 

aim is to recognize e; and not to recognize a; functional (3) can be rewritten as 



m 


f 1 




P 


r 


h 


v(b) = S 


[Vb, 


= true 




1 


V b j = true 


i=l 


U=i 




i=l 




<p[ej)<<p(ai) y 



Let us assume that 0 < p; < 1 is a fuzzy variable [19] associated with e; . If a lot of 
subsets E' are obtained using a random generation, then p; means a probability that 
prototype e; belongs to the optimal prototype subset. Hence, qj =l-p; means a 
fuzzy variable associated with C; . Since prototypes are supposed to be independent 
(6) can be rewritten using qj as follows 




The problem is to find maximum of the analytical function v|/(q), where 
Q = {qi,...,qj^ } . There are some properties of (7) that makes the problem easier. It 
follows from (7) that every qj is represented in each multiplication not more than 
once. Therefore for every q; 

qJ „ <*' 

aii 

Such a function can be called poly-line function [15]. Generally speaking, there is no 
problem to calculate derivatives of the poly-line function. For instance, in case (7) the 
expression for derivative is 
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0V|/ 

5qk 



m k-1 i 



Xnqjnqj+ X Tlqj 

i=k j=l j=k+l <p(ai)>(p(e^)<p(ej)<(p(ai);j^k 



( 9 ) 



So calculations of the gradient almost do not demand additional operations and opti- 
mization based the fastest gradient descent [5] can be used. From (9) it follows that 
-m < 0\|//0qj^ < p that fulfils the Lipschiz condition what is necessary for the exis- 
tence of the maximum (7). Let us prove that poly-line function can be represented as a 
convex combination of its values in the vertices of 2™ single hyper-cube as follows: 



¥(qi 







( 10 ) 



where 0 yj(i)and Bjjj(i) is binary m-dimensional 

decomposition of the integer value i , where i is one of the vertices and each bit 
j = l,...,m of the decomposition with respect to the associated variable qj . Variable 

y j (i ) = q . if bit j in B (i) is equal to one and y ^ (i ) = l - q j otherwise. 

Let us consider the following induction. If m = 1 poly-line function (10) can be rep- 
resented as v|/(q) = aq + b or using decomposition LQ=l-q,Lj=qso 
Lg+L[ =land v|/(q)= (a + b)q + b(l-q) . Let us assume that it is correct for all 
m = 1,2,.., d and let us try to prove that it is correct also for m = d + 1 . From (8) it 
follows that 

v(qi>->qd>qd+i)=Vi(qi,-:qd)qd+i+V2(qi--qd)> 

where vi/i ,\|/2 are poly-line functions that can be represented in form (10). Substitu- 
tion (10) into (11) gives 




Thus, v|/(qi,...,qd,l)=v|/i +\|/ 2 , vloi —.qd V 2 and 

J ^d J 

X=o ^.qd+i+^.(i-qd)=X.= g L; = 1 . Therefore all values in the hyper-cube are 

limited within values in its vertices and so a global maximum of (10) can be found 
only in a vertex of hyper-cube, hence for optimization algorithm it is enough to con- 
sider only values q; e {0,l} . 

The main corollary following from this property is that a variety of tasks that are 
connected with optimization on the 2™ hyper-cube, when the values of the functional 
could be found in every vertex can be solved using poly-line function optimization. 
However this corollary does not show how this poly-line function and its derivatives 
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can be found except of trivial case (10) when all 2™ values are known. In most of the 
pattern recognition tasks functional (3) can be represented as sum of Boolean func- 
tions as (6). Let us prove that if Boolean function F(x X jj) is defined on all of 

2" sets of variables Xj e {0,l}, there exists a poly-line function v|/(p[ p^^ ), where 
Pi is the probability that X| belongs to the optimal subset. 

Indeed every Boolean function F(xj ,..., x^^ ) can be represented using its disjunc- 
tive normal form f(x[ ,..., )= Xj A V X;B V C , where A, B and C do not depend on 

X i . The expectation of v|/(pj ,..., Pjj )respected to F(xi,...,Xj,)is 

v|/(pi ,..., p J = 1 - (l - Pr(c)Xl - Pr(x, A V X;B I C)) . (12) 

Since events Xi A and XiB are independent (12) can be rewritten as 

v|/(pi ,..., p J = 1 - [Pr(c)]{p. [Pr(A I C)- Pr(B I C)]- 1 - Pr(B I C)} . (13) 

and hence, jd^l - 0 . Since X; is an arbitrary variable (13) is correct for any 
i = l,..,n . Formula (17) gives the technique to solve task (3) in common case. 



3. Common n-Dimensional Case 



Let us consider once more expression (2). Object ejwill be correctly classified if 

n 

f; = & vej, is true. Using disjunctive form the same expression can be written also 
for an anti-object as follows: 



fj = V &Ck 

J=lXj,<Xji 



(14) 



If it is supposed that fi,a; = lif (14) is true and zero if it is false, the maximum of 
\|/ = equal to minimum of (3). Any one of expressions (14) can be 

simplified using common variable extraction: ab v ac = a(b v c) and group elimina- 
tion: abva = a and hence, (14) can be represented as 

( m ^ (15) 



f; = a; 



= & V & e: , 

ejEGol^k=lejeGk J 



where Gg is common group, m < n is reduced number of groups and Gj^ is a partial 
group. 

Let us find derivatives of (15). Since inclusion of any e; does not depend on inclu- 
sion exclusion of any other e j , j i into the optimal subset, it can be written: 
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1-fi = n (i“Pi)p>‘ 

from (16) is follows that for any ej g Gq 
5f; da.: 



m ^ - 

Vk=i & 



(16) 



V GkCCej ^&ej£Gjj,l^j ' 



5ej ,_,Go 











(17) 


& Sj 


vr.i & e. 








GGQ,k 7 i:j 


ejeGt 








(15) in 


parenthesis 


can 


be 


written as 


where Fj = 


VG^ae 


j&e 


„ .^.Ciand 


The 


derivative 


of 


R is 


"2 - Fi V F 2 = -F 1 F 2 and finally 








Y 






(18) 


V & ®i & V ■ 







l^GfcSej e,eGk e,eGk J 



Since fj, aj and their derivatives could be calculated using the same groups 
G computation of the gradient gradt|/ = (gi,...,gj, ) almost does not demand addi- 
tional operations. It allows us to derive the following optimization algorithm, based 
on gradient descent [5]. 

1. For every f; ,i = 1,.., m and aj,i = l,..,p obtain formulae (15) and save them via 
groups G]^,k = 0,l,.. 

2. Using random generation, choose an arbitrary vertex Q = (q^,.., ), where 
q; e {0,l} . 

3. For a chosen vertex Qcalculate v|/ and gradvj/ = (g[,...,gjj), where each de- 
rivative g; is sum of (17), (18) for all f; and a; . 

4. Define next vertex P = (pj,..,Pjj ) , where pj^qjif q;=land g;>0or 
q; = 0 and g; < 0 . Otherwise p; = 1-q; . If P = Q (the same vertex) the pro- 
cess is over and local maximum point is reached, otherwise the next step has 
to be done. 

5. If v|/(p)> v|/(q)+ max|g; I , where B includes indices i for which pj^^qjas 

B 

P is the next vertex. Otherwise the next vertex is 
Y = (qi,..,qj_i,l-qj,qj_n,..,q J, where j = argmax|g;| . 

B 

6. Repeat Steps 2-5 for N^,(q) random vertices, where q is the required prob- 
ability for the reaching a global maximum, typically q = 0.95 . The value of 
N|,(q)can be found using results of [15] Nj,(q)« (l + p/m)ln(l-q). 
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7. Select the maximal value E' = arg max v|/(Ej) among Nj,(ri) local maxima. 

Following [15] the algorithm of local maximum search converges after no 
more than me^ steps 2-5, where y = 0.577215... is the Euler constant. The av- 



erage number of steps is C(m) = J exp ^ — 

0 Vi=l * 



px . 



4. Experiments 

Let us consider for example the well-known task of automatic edge detection. The 
purpose is to detect edges of houses and ignore edges of vegetation using aerial pho- 
tos. At the first stage of image processing the convolution map 
M(x, y)= max|l(x, y)*c((p, L, x', y')| is obtained, where C is Canny’s function [4] of 

<p,L 

scale of edge L , rotated to angle cp . Calculation of M(x, y) is done using fast Fou- 
rier transform. Threshold of M(x, y)> Mq is the first parameter. Other parameters can 
be extracted by applying mask C((p, L) = arg max M(x, y)to the image l(x, y)in point 
(x, y) . The second parameter is minimal contrast Cg = max l(x, y)- min l(x, y), the 

x,yGC x,yGC 

third is minimal median value mn= max med I(x, y), med I(x,y) . The next 

° U(x,y)>0 ^ ’^^’c(x,y)<0 ^ 

three parameters are minimal common standard deviation Oq = D[i(x, y)l (x, y)e c] , 
maximal one-side standard deviation 

Gjj, = min(D[l(x, y) I C(x, y) > o] , d[i(x, y) I C(x, y) < o]) and relative Student relation 

t = |E[l(x,y)l C(x,y)> o]-E[l(x,y)l C(x,y)< 0]|/cjn , where E[x]and D[x]are 

expectation and standard deviation of variable x . Using this set of six parameters 
n = 7 , m = 142 positive and p = 180 negative points edge image of Fig. 1 was re- 
ceived. 



5. Conclusion 

The further development of the maximization method of functions given on the verti- 
ces of a n-dimensional single hyper-cube [15] opens the way to decision making in a 
wide range of well-known pattern recognition tasks such as prototype feature selec- 
tion, factor analysis, threshold estimation and so on. The only problem to be solved in 
this way is to obtain a correct expression for Boolean functions and their derivatives 
such as (15), (17), (18). Theoretically, following theorem (10) it always possible but 
practically the derivation of these expressions is a special and sometimes complicated 
task that requires deeply understanding of the being considered problem. 
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Fig. 1. The result of house edge detection using optimal threshold selection 
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Abstract. Gibbs models with multiple pairwise pixel interactions per- 
mit us to estimate characteristic interaction structures of spatially homo- 
geneous image textures. Interactions with partial energies over a particu- 
lar threshold form a basic structure that is sufficient to model a specific 
group of stochastic textures. Another group, referred here to as regu- 
lar textures, permits us to reduce the basic structure in size, providing 
only a few primary interactions are responsible for this structure. If the 
primary interactions can be considered as statistically independent, a se- 
quential learning scheme reduces the basic structure and complements it 
with a fine structure describing characteristic minor details of a texture. 
Whereas the regular textures are described more precisely by the basic 
and fine interaction structures, the sequential search may deteriorate the 
basic interaction structure of the stochastic textures. 

Keywords: image texture, Gibbs random field, interaction structure. 



1 Introduction 

Spatially homogeneous image textures are represented as samples of a parti- 
cular Gibbs random field by specifying a geometric structure and quantitative 
strengths, or Gibbs potentials, of multiple pairwise pixel interactions [21, 'fj . The 
interaction structure determines which pixels directly interact with a particular 
pixel in the sense that they effect conditional probabilities of grey levels in the pi- 
xel. The interacting pixels are usually called the neighbours, and the interactions 
are described by a neighbourhood graph 0. 

The spatially homogeneous interaction structure is represented by several 
families of translation invariant pixel pairs, or second-order cliques of a neig- 
hbourhood graph, each clique family having its own potential. Generally, the 
potentials depend on grey level co-occurrences (GLG) in a pixel pair. The sum 
of the potentials over a clique family is the partial interaction energy that de- 
termines the contribution of the family to the probability of a particular image. 

As shown in m, the analytical first approximation of the maximum like- 
lihood estimate (MLE) of the potential for a particular texture is proportional 
to the centered GLG histogram (GLGH) for the corresponding clique family in 
a given training sample of the texture. Therefore the characteristic interaction 
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structure can be chosen by comparing the analytical estimates of the partial 
energies for all clique families in a particular large search set. 

Many spatially homogeneous textures belong to a specific class of stochastic 
textures that can be efficiently described by only a basic interaction structure 
consisting of interactions with the partial energies over a particular threshold. 
The threshold depends on the relative frequency distribution of the energies for 
the clique families in the search set m 

Textures with a regular visual pattern can be only roughly modelled by such 
basic structure because they have also a fine structure of pairwise pixel interac- 
tions. The fine structure ranks below the basic structure in energy but describes 
visually important repetitive minor details. Generally, the probability distribu- 
tions of the GLCs for the various clique families are statistically interdependent. 
But if the dependence between some families can be ignored, the clique fami- 
lies in the search set can be separated into the two groups: (i) the independent 
primary interactions with the top partial energies and (ii) the dependent secon- 
dary interactions with the lower energies obtained by a statistical interplay of 
the primary interactions. In this case the basic structure can be reduced in size 
and the fine structure can be recovered by an empirical sequential choice of the 
primary interactions that eliminates the secondary ones Pi]' 

This paper compares the empirical sequential scheme of learning the inter- 
action structure to the approximate analytical and the combined analytical- 
empirical sequential schemes. Textures that can be efficiently modelled by the 
sequentially chosen basic and fine structures form a specific group of regular 
textures differing from the stochastic textures. In the same time the sequen- 
tial choice based on partial energies may result in worse interaction structures 
of stochastic textures with respect to the conventional thresholding of partial 
energies. 

2 Search for the Interaction Structure 

2.1 Basic Notation 

Let g = [gi : i G R; G Q] be a digital greyscale image with a finite set of 
grey levels Q = {0, 1, . . . , qmax}- Here, R is a finite arithmetic lattice supporting 
the images. A spatially homogeneous structure C = [Ca ■ a G A] of pairwise 
interactions between the pixels i G R is specified by a particular subset of 
the clique families ■ (i,j) G R^; i — j = consta}. Every family 

consists of the translation invariant cliques (i,j) with the fixed inter-pixel shift 
i- j = consta = {Axa, Aya). 

A partial interaction energy Ea{g) of a clique family in an image g is 

Ea(g)= ^ K(5.,ff,)=Va.Ha(g) (1) 

(i.t)eCa 

where Va = [E(g, s) : (g,s) G Q^] is a Gibbs potential for the clique family Ca 
with values depending on the GLGs (g, s), Ha(g) is the GLG histogram (GLGH) 
collected in the image g over the family Ca, and • denotes the dot product. 
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Let V = [Va : a S A] and H(g) = [Ha(g) : a £ A] denote the potential 
vector and the GLCH vector, respectively. The GLC-based Gibbs image model 
with multiple pairwise pixel interactions 

Pr(g|C,V) = ^exp(i?(g)) (2) 

relates the probability of every sample g to its total interaction energy ^(g): 

£;(g)= ^K(g)=V.H(g). (3) 

oGA 

As shown in the analytical first approximation of the MLE of the potential 
Va is proportional to the difference between the GLGH Ha(g°) and the expected 
uniform GLGH Ha,irf for the samples of the independent random field (IRF), 
or what is the same, to the centered GLGH. The IRF is described by the Gibbs 
model of Fq. (0 with zero- valued potentials Va = 0. 

Therefore, characteristic interaction structure can be recovered by comparing 
the analytical estimates of the relative partial interaction energies 

ea(g“) = Ha(g°) . (Ha(g°) - Ha.irf) (4) 

for a large search set W of possible clique families Ca. In all the experiments 
below, the search set W contains 3280 clique families with the inter-pixel shifts 
in the range —40 < Axa^ Ay a < 40. 

2.2 Basic Structure via Thresholding the Partial Energies 

In the simplest case, the basic interaction structure can be learnt by comparing 
relative partial energies ea(g°) to a threshold that depends on the frequency 
distribution of all the energies for the search set W. Such a structure is sufficient 
to simulate many natural image textures called stochastic textures in m- 




Fig. 1. Training and simulated samples D29 (a, b) and DlOl (c, d) with the basic 
structures learnt by thresholding the analytically estimated partial energies. 



Figure Eshows, for example, the training and simulated samples 128 x 128 of 
the textures D29 “Beach sand” and DlOl “Gane” Q. The basic interaction struc- 
tures (11 and 39 clique families, respectively) were learnt by using the threshold 




750 



G. Gimel’farb 



0 = -Em + 4(7 where Em and a are the mean energy and the standard deviation, 
respectively. Such a basic structure results in the visually good simulation of 
the stochastic texture D29. But the minor repetitive details of the more regular 
texture DlOl are not described at all, and the simulation gives only a very rough 
approximation of the original visual pattern. 

2.3 Empirical Sequential Learning 

The above thresholding may produce basic structures of larger size than one 
needs for describing the characteristic visual features of a texture if some GLC 
distributions over clique families with the top partial energies can be conside- 
red as statistically independent. Below, such families will be referred to as the 
primary ones. Let Cq, and be the primary clique families with the sizable 
energies Ea{g°) > 9 and > 9, respectively. Then they can give rise to 

a lower but still sizable energy E^{g°) for the secondary family C.^, such that 
const-.^ = constc -I- const,g, although the latter family may not take part in the 
basic structure. It is evident that the straightforward thresholding cannot detect 
a fine interaction structure describing minor but visually characteristic regular 
details of a texture if their interaction energies are lower than the energies of 
secondary interactions produced by the primary basic ones. 

Empirical sequential learning, proposed first by Zalesny m, reduces the 
basic structure to only the primary interactions and recovers the fine structure 
by repeating iteratively the texture simulation and structure selection steps. We 
shall restrict our consideration to the specific type of sequential learning that is 
based on the relative partial energies of interactions. At each iteration t, a new 
image sample g[*l is simulated under a current interaction structure Cl*l. Then 
the GLCHs Ha(g°) for a given training sample g° are compared to the GLGHs 
for the simulated sample, and the clique family with the maximum 
relative partial energy 

e,(g“) = H„(g“) . (H„(g“) - H„(g[*l)) (5) 

is selected to be added to the current structure. 

In principle, all the statistical interplay between the primary and secondary 
energies is taken into account by simulation so that both the basic and fine 
structures of the minimum size are expected to be found. But it should be noted 
that imahe simulation with a fixed interaction structure and potentials results in 
a set of different samples such that their GLGHs approach the GLGHs for a given 
training sample only in average. Therefore the obtained basic and especially fine 
structures will reflect also a particular sequence of simulated images, and the 
same training sample may produce notably different interaction structures. 

Figures EJS show the textures simulated after learning the interaction struc- 
ture by the empirical sequential search using the training samples D29 and DlOl 
in Figure Q The primary structure found for the texture D29 contains only the 
two clique families with the inter-pixel shifts [0, 1] and [1, 0]. The additional fine 
structure includes the families with very low energies. As a result, they are cho- 
sen rather arbitrary, and the resulting structure is unsuitable for simulating the 
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Fig. 2. Texture D29 simulated with 4 (a), 5 (b), 7 (c), and 11 (d) clique families found 
by the sequential empirical choice of the family with the top relative partial energy. 
Notice that these images do not mimic the initial visual pattern of Figure Qa. 




Fig. 3. Texture DlOl simulated with 7 (a) - 14 (h) clique families found by the 
sequential empirical choice of the family with the top relative partial energy. 



texture samples that possess the visual similarity to the training sample D29 in 
Figure Ea. The images in Figure 01 in contrast to the simulated sample in Fi- 
gure □ b, differ much from the training sample even when the overall interaction 
structure is of the same or greater size than the basic structure recovered by 
thresholding the energies. 

But as shown in Figures 0 and ^ the overall structure found sequentially 
for modelling the texture DlOl with a repetitive visual pattern, both contains 
less clique families and represents better the fine details than the basic structure 
found by thresholding the energies. The sequential choice of a single top clique 
family with the highest relative energy of Eq. o proposed in m forms the 
basic structure of about 16 clique families and the fine structure of 4-6 clique 
families (compare Figure [Qd to Figures El and E] 

Similar results in Figure El are obtained twice faster by choosing the two top 
families at each step. But the structures of similar size obtained by choosing more 
than two clique families per iteration give somewhat worse simulation results (see 
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Fig. 4. Texture DlOl simulated with 15 (a) - 22 (h) clique families found by the 
sequential empirical choice of the family with the top relative partial energy. 



a 




b 




e 









h 



Fig. 5. Texture DlOl simulated with 4 (a) - 12 (e), 16 (f), 18 (g), and 22 (h) clique 
families found by the sequential empirical choice of the top two families. 



Figure 0. Also, it should be noted that the visual quality of simulation does not 
steadily increase with the structure size. As follows from Figures 00, the quality 
may even degrade after adding a clique family and then be restored after adding 
one-two more families. 



2.4 Analytical Sequential Learning 

Assuming the probability distributions of the GLCs for the primary clique fami- 
lies are statistically independent, the secondary interactions with relatively high 
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Fig. 6. Texture DlOl simulated with 4 (a) - 32 (h) clique families found by the 
sequential empirical choice of the top four families. 



energies can be approximately taken into account by recomputing each secon- 
dary GLCH, H..y(g), using each currently chosen primary GLCH, H„(g), and 
the corresponding previous GLGH H/ 3 (g) such that const.^ = constc + const/ 3 . 
In this case all the GLGHs in the search set W can be analytically updated 
after adding to a current interaction structure the next clique family with the 
maximum relative energy of Eq. with respect to the training sample. Such an 
analytical estimation does not take account of all the statistical interplay of the 
families but only approximates the actual distribution of the relative energies. 

Figures |71a-b, and|Hla-b demonstrate the grey-coded actual and analytically 
computed distributions of the partial energies of Eq. 0 for the textures D29 and 
DlOl over the search set W. Here, each square box of size 4x4 pixels represents a 
particular inter-pixel shift const^ = {Ax, Ay)-, —40 < Ax, Ay < 40. The energy 
distributions for the textures D29 and DlOl are computed, respectively, with 
the 10 and 15 sequentially chosen primary clique families. The corresponding 
interaction structures as well as the more detailed structures with the 22 clique 
families are shown in Figures [3,c-d and|Hlc-d. The low-energetic fine structure 
of the texture D29 does not represent specific visual features and is obviously 
arbitrary, as distinct from the regular fine structure of the texture DlOl. 

The sequential analytical scheme results in a sufficiently accurate approxi- 
mation of the actual interaction energies. Therefore it can be used for reducing 
the size of the basic structure with respect to the like structures obtained by 
thresholding the energies of Eq. 0). The samples DlOl in Figure El a-c simu- 
lated with the analytically found interaction structures containing 15-25 clique 
families are very similar to the sample DlOl in Figure Ed simulated with the 
39 families. The texture in Figure Eld simulated with the 32 clique families re- 
flects also some fine visual details but to the lesser extent than the samples in 
Figures E[e-h, 0d-h, andElh, obtained by the empirical sequential learning. 
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Fig. 7 . Estimated with 10 clique families (a) and actual (b) energies for the D29 model 
and the interaction structures with 10 (c) and 22 (d) analytically chosen families. 




Fig. 8. Estimated with 15 clique families (a) and actual (b) energies for the DlOl model 
and the interaction structures with 15 (c) and 22 (d) analytically chosen families. 




Fig. 9. Texture DlOl simulated with 15 (a), 19 (b), 25 (c), and 32 (d) analytically 
chosen clique families. 



As concerning the stochastic texture D29, the analytical sequential search for 
the top relative partial energies has the same drawbacks as the empirical one. 
Figures OJa-b, 0 and E3 show that the basic structure recovered by threshol- 
ding produces much better simulation results even when the analytically chosen 
structure is of larger size. 

The possible reason is that the assumed statistical independence of the GLC 
distributions for the primary interactions does not hold for this texture so that 
the exclusion of the secondary interactions by using the relative energies of 
Eq. (0 is not justified. In such a case the search for the reduced basic structure 
and additional fine can only deteriorate the actual basic structure. 
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Fig. 10. Texture D29 simulated with 4 (a), 10 (b), 15 (c), and 22 (d) anal3rtically 
chosen clique families. 



2.5 Combined Sequential Learning 

As follows from the above experiments, the sequential learning can produce the 
efficient interaction structures only if our assumption about the independent pri- 
mary GLC distributions has a reasonable fit to the textures under consideration. 
If this assumption holds, let these latter be called the regular textures. 

The empirical sequential learning outperforms the faster analytical scheme 
as concerning the fine interaction structure of a regular texture. But the reduced 
basic structures recovered empirically or analytically are very similar so that the 
sequential learning can be accelerated by combining the both approaches. 

Figure nrishows the results of simulating the texture DlOl when the reduced 
basic interaction structure with 15 clique families is first found analytically (see 
Figures 0 and 0) and then is appended with the fine structure of 1-8 clique 
families by the empirical learning. It is evident that the purely empirical and 
the combined analytical-empirical sequential learning produce very similar final 
results, but the latter approach is much faster than the former one. 

3 Conclusions 

These and other our experiments (as well as experiments 0 in empirical sequen- 
tial learning based on the chi-square distances between the GLCHs) suggest that 
modelling of spatially homogeneous textures with the Gibbs model of Eq. 0 
must take account of possible statistical dependences between the clique families 
that form the characteristic interaction structure. Stochastic textures introduced 
in m have basic stuctures of only weakly interdependent primary interactions 
so that no interaction with a sizeable partial energy can be considered as the 
secondary one and excluded from the basic structure. 

Regular textures differ from the stochastic ones in that they can be modelled 
by the reduced basic and the additional fine structures. The initial basic struc- 
ture contains both the strongly and weakly interdependent interactions with a 
sizable energy. Assuming that only the top-energetic interactions are the inde- 
pendent primary ones, the basic structure is reduced in size by the empirical or 
analytical sequential exclusion of the dependent secondary interactions. Then the 
fine structure is recovered in the like way by the empirical sequential learning. 
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Fig. 11. Texture DlOl simulated with the interaction structure containing 15 ana- 
lytically found clique families and the additional 1 (a) - 8 (h) families found by the 
sequential empirical choice of the top family. 



The sequential learning extends the range of image textures that can be 
modelled by multiple pairwise pixel interactions but it does not replace the sim- 
ple energy thresholding for the stochastic textures. Also, the sequential learning 
schemes, as well as the parallel thresholding of partial energies, have still no theo- 
retically justified rules for choosing an adequate size of the interaction structure. 
Thus the number of clique families to be used in the Gibbs model of a particular 
texture is selected, mainly, on the experimental base. Our experiments and ex- 
periments in |4I6| show that many natural spatially homogeneous images are of 
the stochastic or regular type. But a vast majority of images are outside these 
types and should be modelled by other means. 
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Abstract. One crucial issue in automatic document analysis is the 
discrimination between text and graphics/images. This paper presents a novel, 
robust method for the segmentation of text and graphics/images in digitized 
documents. This method is based on the representation of window-like portions 
of a document by means of their gray level histograms. Through empirical 
evidence it is shown that text and graphics/images regions have different gray 
level histograms. Unlike the usual approach for the characterization of 
histograms that is based on statistics parameters a novel approach is introduced. 
This approach works with the histogram Fourier transform since it possesses all 
the information contained in the histogram pattern. The next and logical step is 
to automatically select the most discriminant spectral components as far as the 
text and graphics/images segmentation goal is concerned. A fully automated 
procedure for the optimal selection of the discriminant features is also 
expounded. Finally, empirical results obtained for the text and graphics/images 
segmentation using a simple three-layer perceptron-like neural network are also 
discussed. 

Keywords: Feature extraction and selection; Image analysis; Applications: 
automatic document analysis. 



1. Introduction - The Gray Level Histogram as a Discriminant 
Tool for Text and Graphics/Images Segmentation 

Document image analysis is an active research and development field [1] in which 
pattern recognition techniques are of the greatest interest. One critical issue in the 
automatic analysis of digitized documents is the separation of text and 
graphics/images. The text regions of the document are usually analyzed by means of 
well-known OCR techniques, whereas the graphics and images are just codified in 
order to obtain optimal storage and retrieval of such information. This communication 
describes a novel method for the segmentation of text and graphics/images. This 
method exploits the empirical evidence that regions of text and regions of 
graphic s/images have very different gray level histograms. As an illustration, figure 1 
shows two examples. Notice the application of a window on the original digitized 
document in order to compute the brightness histogram in small portions of the whole 
document. The practical issues concerning the window size and the scanning process 
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over the complete document, although important, are not specifically considered in 
this paper. 




traves de Internet de formi 

^i^^Way Software. Sun 



Fig. 1. Examples of text and images regions in digitized documents 

In figure 2 the corresponding histograms are displayed. Notice the similarity 
between the text regions histograms and their differences with the image regions 
histograms. Therefore it seems reasonable to exploit the gray level histogram patterns 
for the discrimination between text and graphics/images. 
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Fig. 2. Gray level histograms for the window indicated in figure 1. 



After an exhaustive empirical work we have confirmed the similarity of the gray 
level histograms of text regions in which there appear a large accumulation of gray 
levels on the right side, corresponding to the background pixels and a small 
distribution on the low gray levels due to the characters’ pixels. Almost all these 
histograms are purely bimodal and close to normal distributions. On the contrary, the 
graphic s/images regions do not present such uniform pattern, which is logical due to 
the variety of the graphics and in particular of the images. Nevertheless, as far as the 
segmentation of text and graphics/images is concerned, the important issue is the 
uniformity of text histograms. 
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2. Histogram Fourier Transform 

Once the idea of using the histogram pattern as the basis for the segmentation of text 
and graphics/images is accepted, the next step is to obtain an appropriate set of 
discriminant features from the histogram. The usual approach is to compute first- 
order statistics parameters: mean, variance, skewness, kurtosis, entropy, etc. 
Unfortunately, in many cases these discriminant features are not satisfactory as they 
do not convey all the discriminant information of the whole histogram and therefore 
some authors prefer to work with the complete gray level distributions [5]-[8]. 

In this paper we have taken an intermediate approach by using the magnitude 
spectrum of the histogram as the initial set of discriminant features. 

Let h( i ) be the gray level histogram of some generic region - the scanning window 
- , then its Fourier transform is: 



^ 1=0 



.Ik . 

- ; — lu 
N 



0<i<N-l ; 0<u<N-l 

The amplitude spectrum of H(u) is the magnitude of H(u): 

A(u)= Hl(u) + Hl(u) 
0<u<N-l 



and its phase is: 



'¥(u) = arctan 






^Re(M). 

0<M< A^-1 



( 1 ) 



(2) 



(3) 



For the characterization of the histogram pattern it is sufficient to concentrate on 
the magnitude of the Fourier transform and ignore the phase. Furthermore, the 
amplitude spectrum is invariant under gray level histogram translations: 

If h{i) = g{i-t),yi^[f),N -\] ^ A{H(u)} = A{G(u)},yue[0,N-l] (4) 



3. Automatic Selection of the Optimal Discriminant Features 

Although independent to translations, the amplitude spectrum has as many 
components as the original histogram, so that a drastic reduction in its dimension is 
crucial. Obviously, such reduction is aimed at obtaining the minimum number of the 
best discriminant features. In the next section we describe an automatic procedure for 
the selection of the best discriminant features of the amplitude spectrum. 




760 M.A. Patricio and D. Maravall 



Step 1. Data Normalization 



First, a zero mean and unit variance data normalization is applied to all of the 
magnitude spectrum components. 

For a generic class or pattern, ak, the j* feature average, mj, and j* variance CTj^, are 
computed as the sample mean and the sample variance: 



.1 

m = y Xf. 






Nj- 



: ^ 1 = 1 



(5) 



where Nj is the number of the design samples available from class ak- 
Then, the generic]* feature component, Xj, is normalized as follows: 



^7 = 



CT; 



( 6 ) 



For the purpose of simplicity of notation we drop out the use of asterisks in the 
next section, although all the data have been previously normalized. 



Step 2. Discriminant Feature Ordering 



For each individual discriminant variable, X;, the generalized Fisher ratio is computed: 






R: = 



m, = 






(jI+(71+... + (jI 

m,.i+m,.2+... + m,.^ 



(7) 



N 



where the index j=l, 2, ..., N represents the corresponding pattern or class. For the 
text and graphics/image segmentation case, obviously j=l, 2. The index i=l, 2, ..., n 
stands for the individual features and for digitized documents with 256 brightness 
values i=l, 2, ..., 256. 

After computing the Fisher ratios the features are arranged as follows: 



X^,X2,...,X^ / >i?2 > ->^iV 



( 8 ) 
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Step 3. Selection of the Feature Vector 



After completing step 2, the first action is to select the optimal individual feature. Let 
Xi be the most discriminant feature: i. e. Ri is the highest Fisher ratio in expression 
(8). Once the optimal feature Xi has been chosen, the next action is to form all 
possible two-dimensional vectors that include the optimal feature Xi and to select the 
pair with the highest discriminant ratio. This process is repeated with three, four, five, 
etc components of the feature set until the performance improvement is less than 
certain threshold. 

For the evaluation of the discriminant capacity of multidimensional feature vectors 
several indices or ratios can be used: 



j _ traceiSg) 

^ traceiS ^ ) 
det|5„ 

h = — ^ 

det|5^ 

=trace{s^S g) 



(9) 



Sb is the between scatter matrix that provides a quantitative indication of the 
dispersion among the classes: 



N 

/=1 



(10) 



where pi is the a priori probability of class ai ; nii is the vector mean of the class ai 
and m is the global vector mean: 

N 

i=\ 

The within matrix, Sw, is the average covariance matrix: 



S 



w 






c,. =£{(x-m,.)(x-m,.)^) 



( 11 ) 



In figure 3 the flowchart of the automatic selection algorithm is displayed. 
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Fig. 3. Flowchart of the automatic feature selection algorithm. 
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4. Results 

Using a commercial 300 dpi scanner many different types of documents have been 
digitized in order to test the segmentation method. We have been careful in testing a 
large variety of fonts and types of documents: books, magazines, newspapers, etc. 
After empirical evaluation the final window size is 80x35 pixels. We have used 585 
labeled samples for the automatic selection of the optimal set of discriminant features. 
After applying the method previously described, the following discriminant features 
were obtained at each sequential level: 

Level 1 ^ (xi6i) 

Level 2 ^ (xi6i, X47) 

Level 3 -> (xi6i, X47, X209) 

The algorithm ended at level 3 because the next level produced less than 1% 

improvement on the discriminant index. 



X .47 




C lasses 

▼ Graphics/Images 
-I- Text 



Fig. 4. Dispersion diagram of text and graphics/image pattern versus the three most 
discriminant features. 

In figure 4 one can observe the excellent discriminant quality of the selected 
spectrum coefficients Xiei, X 47 and X 209 , which makes the design of the automatic 
classifier a trivial task. Consequently, we have chosen a very simple three-layer 
perceptron-like neural network with a single inner layer formed by two neurons, 
which is displayed in figure 5. 
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Class 1 Class 2 

(text) (graphics/images) 



Output layer 



Hidden layer 



Input layer 



t t 

Xi61 X47 

Fig. 5. Multilayer perceptron for text and graphics/images segmentation. 

This neural network has been trained with the well-known backpropagation rule with 
a learning rate of 0.5 and a maximum permitted deviation of 0.1. The neural network 
converges after very few iterations. 

For testing we have employed 585 samples (297 for text regions and 288 for 
graphic s/images regions) taken from a variety of documents: newspapers, magazines, 
books, etc. The results are summarized in the confusion matrix of table 1, where Class 
1 corresponds to text regions and Class 2 to graphics/images regions. The success 
ratios for both classes are always higher than 99 per cent. 

Table 1. Confusion matrix for the 585 test samples. 



True 1 
Class 2 



Segmentation 



Result 
1 2 



295 


2 


1 


287 





In figure 6 two examples of segmented documents are shown. The segmentation 
results appear on the right , where the windows stand for the image regions classified 
as text. 
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Fig. 6. Segmentation results. On the left are displayed the digitized documents. The automatic 
recognizer explores the document from top to bottom and from left to right with a 85x35 
window. The segmentation results appear on the right, where the rectangles correspond to the 
text regions. 



5. Conclusions 

A novel, robust method for the segmentation of text and graphics/images in digitized 
documents has been described. This method is independent of fonts and type of 
document: books, magazines, newspapers, etc. 

The basic idea is to use the magnitude spectrum of the gray level histogram of an 
image window as the initial set of discriminant features. Afterwards, a procedure for 
the automatic selection of the best discriminant features is applied. Due to the 
outstanding discriminant quality and the small number of selected features, the design 
of the classifier is quite trivial. A simple three-layer perceptron neural network, 
trained with the backpropagation rule, is applied for the segmentation of text and 
graphic s/images, giving excellent results with at least 99% success. 
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Abstract. In this work, fast approximate nearest neighbours search al- 
gorithms are shown to provide high accuracies, similar to those of exact 
nearest neighbour search, at a fraction of the computational cost in an 
OCR task. Recent studies [Ztill .i) have shown the power of fc-nearest neig- 
hbour classifiers (fc-nn) using large databases, for character recognition. 
In those works, the error rate is found to decrease consistently as the size 
of the database increases. Unfortunately, a large database implies large 
search times if an exhaustive search algorithm is used. This is often cited 
as a major problem that limits the practical value of the fc-nearest neig- 
hbours classification method. The error rates and search times presented 
in this paper prove, however, that fc-nn can be a practical technique for 
a character recognition task. 

Keywords: Handwriting Recognition, OCR, Fast Nearest Neighbour 
Search, Approximate Search, fc-NN. 



1 Introduction 

Statistical non-parametric methods, such as fc-nearest neighbours classifiers, are 
receiving renewed attention in the latter years, since very good results are being 
reported on many pattern recognition tasks (e.g. |2H|,0). Their theoretical pro- 
perties, already known for at least three decades, are also being revisited and 
restated under milder assumptions IZD, 0. 

One of the basic requirements for these methods to obtain good performan- 
ces, however, is the access to a very large database of labeled prototypes. In 
some tasks, like handwritten character recognition, collecting a large number 
of examples is not as hard as in other applications, but searching through the 
whole database to find the nearest objects to a test image is time-consuming, 
and has to be done for every character in a document . This has been a recurring 
argument against the use of fc-nn for this task, since a character recognizer is 
supposed to carry out many classifications per second to on a moderately po- 
werful machine to be useful and competitive. Additionally, the whole database 
has to be stored in main memory to perform the search efficiently, since access 
to secondary storage would penalize even further the search time. 
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The fast increase in the memory capacity of recent computers alleviates the 
space occupation problem, making possible the storage of up to tens of millions 
of prototypes in a typical personal computer. The speed problems, on the other 
hand, can be approached from two (potentially complementary) points of view: 
trying to reduce the number of prototypes without degrading the classification 
power, or using a fast nearest neighbours search algorithm. 

The first approach has been widely studied in the literature, with techniques 
like editing 0, condensing LVQ DSM ^2], and their multiple 
variations being the better known representatives. These methods, have shown 
good results, equaling or sometimes improving the classification rates of the fc-nn 
rule. Their power resides in the smoother discrimination surfaces they yield, by 
eliminating not only the redundant prototypes, but also the dubious ones that 
appear to “get into other classes’ regions” and give rise to highly intricate sepa- 
ration surfaces. Smooth discrimination surfaces mean less risk of over-learning 
(good performance on the training set, or on a given validation set being used 
to guide the process or set the parameters, but poorer results on an unseen test 
set). In a pure fc-nn classifier with a large database, this feature is provided by 
an adequate choice of fc, which usually has to be made larger as the size of the 
database grows. 

The second approach, adopted in this work, has also been extensively studied 
in the literature. A number of methods exist to reduce the cost of an exhaustive 
search of the prototypes set to find the k nearest neighbours to a test point. A 
brief review of these techniques is presented in the next section. 



2 Fast Nearest Neighbour Search Methods 

The nearest neighbour search problem can be formulated in several distinct do- 
mains: from Euclidean vector spaces to (pseudo) metric spaces. Most algorithms 
intended for vector spaces are directly based on the construction of a data struc- 
ture known as kd-tree nm, PI, HH|. A kd-tree is a binary tree where each node 
represents a region in a fc-dimensional space. Each internal node also contains 
a hyperplane (a linear subspace of dimension k-1) dividing the region into two 
disjoint sub-regions, each inherited by one of its sons. Most of the trees used in 
the context of our problem divide the regions according to the points that lay 
in them. This way, the hierarchical partition of the space can either be carried 
out to the last consequences to obtain, in the leaves, regions with a single point 
in them, or can be halted in a previous level so as each leaf node holds b points 
in its region. In E3, a very illustrative general exposition of the methods based 
on fcd-trees is presented, along with a general scheme that allows the reader to 
clearly identify the different existing variants and choose one (or a combination 
of them) according to the peculiarities of the problem. In [,'I1 jca.n be found a 
thorough study of the cost of many kd-tree algorithms for different sample sizes, 
dimensionalities and values of b. 

Other algorithms, as the one proposed by Fukunaga and Narendra El, per- 
form a hierarchical partition of the space resorting to concepts different to those 
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used in the fcd-trees, for example, by means of a hierarchical clustering of the 
data. The search in the trees obtained by these methods is performed apply- 
ing similar concepts to the ones discussed for the classical fcd-trees. Other data 
structures intended to improve on the fcd-trees are for example the VP-trees m 
and the Geometric Near-neighbour Access Trees (GNATs) 0. The methods ci- 
ted have been developed by researchers working in the fields of Algorithmics, 
Data Structures, Gomputational Geometry, Pattern Recognition, etc., and are 
oriented towards main memory storage of the data structure, but a large num- 
ber of disk-oriented data structures have also been devised in the field of Spatial 
Databases, Geometric Queries in Databases, Image Search and Retrieval, etc., 
among them the K-D-B-Trees |25|, the R-trees flj, R*-trees |2|, X-trees 0, etc. 
Unfortunately, methods coming from these not-so-distant fields have not been 
compared or combined often. 

A problem closely related to the search in a set of prototypes of the nearest 
neighbour of a point is the search of a subset with the k nearest neighbours, for a 
given constant k. In classification applications, the k Nearest Neighbours Rule is 
a classical method which offers consistently good results, ease of use and certain 
theoretical properties related to the expected error. The extension of most of 
the referenced algorithms to this variation of the problem is simple, with a cost 
equivalent or inferior to k times the cost of the original algorithm. 



3 Approximate Nearest Neighbour Search 

In many cases, an absolute guarantee of finding the real nearest neighbour of 
the test point is not necessary. In this sense, a number of algorithms of appro- 
ximate nearest neighbour search have been proposed. These methods can also 
be regarded as sub-optimal algorithms for the original problem of exact nearest 
neighbour search P, j^, P, |23|, [Q]) pij . 

But, why seek a suboptimal solution if so many sub-linear exact nearest neig- 
hbour search algorithms exist? The answer to this question comes from practical 
issues. For instance, the average costs of the most popular exact algorithms based 
on search trees are analysed in According to this author, it is not correct to 
assume that these algorithms achieve logarithmic or lower average costs in many 
practical cases. That assumption is valid for low dimensionalities, but when the 
number of components of the points gets larger, the number of points necessary 
to keep the average cost in the same terms is often extremely big. 

In this work, two different approximate nearest neighbour algorithms have 
been tested. The first one is based on the classical kd-tiee method. In a fcd-tree, 
the search of the nearest neighbour of a test point is performed starting from the 
root, which represents the whole space, and choosing at each node the sub-tree 
that represents the region of the space containing the test point. When a leaf 
is reached, an exhaustive search of the b prototypes residing in the associated 
region is performed. Unfortunately, the process is not complete at this point. In 
that case, the cost involved in the search would be logarithmic with the number 
of points and the technique would be definitive and extremely useful. However, as 
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noted before, it is perfectly possible that among the regions defined by the initial 
partition, the one containing the test point be not the one containing the nearest 
prototype. It is easy to determine if this can happen in a given configuration, in 
which case the algorithm backtracks as many times as necessary until it is sure 
to have checked all the regions that can hold a prototype nearer to the test point 
than the nearest one in the original region. The resulting procedure can be seen 
as a Branch-and-Bound algorithm. 

If a guaranteed exact solution is not needed, the backtracking process can be 
aborted as soon as a certain criterion is met by the current best solution. In P, 
the concept of ( I + e)-approximate nearest neighbour query is introduced, along 
with a new data structure, the BBD-tree. A point p is a (1 + e)-approximate 
nearest neighbour of q if the distance from p to g is less than 1 + e times the 
distance from p to its nearest neighbour. One of the splitting rules proposed for 
the BBD-tree, and the algorithm used to perform the (l-l-e)-approximate nearest 
neighbour queries, based on a priority search scheme, have been used on con- 
ventional kd-trees in the experiments, ran using the implementation provided by 
D.M. Mount and S. Arya, available from http:// www.cs.umd.edu/~mount/ANN. 
The parameters used were the ones by default, i.e. sliding midpoint splitting and 
bucket size of 1 point. 

The second approximate nearest neighbour search algorithm tested is ba- 
sed on The Extended General Spacefilling Curves Heuristic m- The method 
works by mapping several times each prototype (an n-dimensional point) into 
the one dimensional Real Line through the application of a Spacefilling Map- 
ping. Each mapping is preceded by a different set of rotations, normalization 
and transformations The unidimensional values that correspond to each map- 
ping (sub-model) are sorted and stored into a vector, inserted into a &-tree or 
inserted into an indexed table. When a test point p is presented to the system, 
it is mapped again into the Real Line with a different transformation for each 
of the r submodels. The b nearest neighbors of the unidimensional value in the 
Real Line, for each sub-model, can be readily found using a conventional search 
in the 6-tree, in 0(log N + b). The union of the r sets of 6 neighbours obtained 
from the r submodels produces a set (of size < rb, typically much lower than that 
upper bound) which will be exhaustively searched to find the nearest neighbour 
of p in the original multidimensional space. The constant 6 can be considered 
loosely related (and intentionally homonym) to the number of prototypes that 
is assigned to each leaf node in the models where a fcd-tree does not partition 
the space completely. 

4 Databases Used 

The well-known NIST Special Databases 3 and 7 have been used in all the 
experiments. These databases of isolated handwritten characters, composed of 
lower-case, upper-case letters, and digits, were used as training and test sets in 
the First Census Optical Character Recognition Systems Conference, sponsored 
in 1992 by the American National Institute of Standards and Technology (NIST). 
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The purpose of that event was to determine the sate of the art in off-line hand- 
written character recognition, and twenty nine participants from North America 
and Europe took part in it. A training database (SD3) consisting of 223,125 
handwritten digits, 45,313 lower case letters and 44951 upper-case letters, with 
128x128 binary pixels images, segmented and labelled by hand, was delivered. 
Twenty days later, a test database (TDl, known today as SD7) composed of 
59,000 digits, 24,000 lower case letters and 24,000 upper case letters was sent to 
the participants, who had to return the results in the next 15 days. 

The conference participants and many researchers thereafter have used SD3 
and SD7 to test character recognition methods. An important conclusion obtai- 
ned from that experience is that SD7 is significantly harder than SD3 for digits 
and at least very different, if not harder, for upper and lower case letters. In fact, 
SD3 is often split into a training and a test set, and SD7 is taken as a second test 
set. In those cases, SD7 is sometimes referred to as “hard test” since its error 
rates are considerably larger. The reasons given for that behaviour are related 
to the different ways in which SD3 and SD7 were obtained. 

Although both databases were acquired by segmenting the characters filled 
out in boxes on forms, the forms for SD3 were completed by 2100 permanent 
Census field workers, who were probably very motivated and conscious of the 
importance of legible writing in the processing of large amounts of forms. SD7, on 
the other hand, was acquired from 500 high school students who were forced to 
fill out the forms in class. Additionally, SD7 forms were segmented by a different 
person than SD3 forms. 

Some of the participants in the conference used exclusively SD3 to train, but 
others used proprietary databases. Among the first, the best recognition results 
at zero-rejection-rate were 96.84% for digits, 96.26% for upper-case letters, and 
87.26% for lower-case letters. The methods used in the best system for digits 
and in most of the best placed systems were based in the fc-nearest neighbours 
rule. 

In f1 3] . SD3 was used to compare several classifiers on handwritten digits, 
namely Multilayer Perceptrons, Radial Basis Function Networks, Gaussian Pa- 
rametric Classifiers, and fc-Nearest Neighbours methods. 7480 digits randomly 
selected among the first 250 writers were used as training and 23140 digits ex- 
tracted from the second 250 writers constituted the test set. The best results 
were again from the fc-nearest neighbours methods. In a perturbation me- 
thod using neural networks is proposed, achieving an excellent 97.1% result on 
the SD7 digits hard test. The idea of perturbating (distorting) the characters 
has been also used in the experiments of this work. 

5 Experiments 

The preprocessing and feature extraction methods employed were very simple. 
In the first place, the character images were sub-sampled from their original 
128x128 binary pixels into 14x14 gray value representations by first computing 
the minimum inclusion box of each character, keeping the original aspect ra- 
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tio, and then accumulating into each of the 14x14 divisions the area occupied 
by black pixels to obtain a continuous value between 0 and 1. Principal Com- 
ponents Analysis (or Karhunen-Loeve Transform) was then performed on the 
image representations to reduce its dimensionality to 45. The reduced resolution 
and final dimensionality values were chosen as a good compromise for all the 
types of characters tested after extensive experimentation. 

Usual slant normalization techniques were not found to improve the results, 
but the insertion of artificially slanted images to the training set produced sig- 
nificant improvements. The slant was applied to the original binary images and 
consisted on right or left-shifting each row an integer number of positions. The 
central row was never shifted, and the amount of shift increased linearly from 
there to the top and bottom rows, respectively. The new pixels entering the area 
after a shift were set to white. Morphological erosions and dilations on the ori- 
ginal images were also introduced in the database as a preliminary test to find 
out if adding distorted versions of the training data can be useful. The results 
of these tests are presented later in this section. 

A first set of experiments were performed using the first 200,193 digits from 
SD3 as a training set and the remaining 22,903 digits as test set (no writer 
was split by this setting). The results for the Spacefilling Curves model (SPFC) 
with one of the best combinations of r (number of submodels) and b (number 
of neighbours on the Real Line), namely r=15 and b=80, are shown in Table 
n With the same training and test sets, the fcd-tree model performed better for 
a range of values of e, as can also be seen in Table Q] These results are at zero 
rejection rate, for a number of neighbours k=4, which gave the best results in all 
the tests. In our experience, the SPFC method outperforms kd-trees only when 
the data points follow uniform distributions, which is not the case for most real 
pattern recognition tasks. 



Table 1. Results of handwritten digit recognition with a partition of the NIST SD3 
database for training and test. CPU times in a Pentium II - 450Mhz machine are shown. 



Method and Setting 


Recog. Rate (%) 


Search Time (ms/char) 


SPFC 


99.09 


6.44 


fcd-tree e = 0.5 


99.21 


12.66 


fcd-tree e = 1.5 


99.21 


2.40 


fcd-tree e = 3.0 


99.10 


0.65 



According to these first results, the rest of the experiments were focused on 
the kd-tiee models. Similar tests with the first 38,678 upper-case letters from 
SD3 as training and the remaining 6,273 as test were performed, along with 
tests using the first 38976 lower-case letters for training and the remaining 6,337 
for test. The results of these experiments are summarized in Table |21 In this 
case, the best numbers of neighbours were k=4 and k=6 respectively. 
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Table 2. Results of upper and lower-case letters recognition with partitions of the 
NIST SD3 database for training and test. CPU times are in ms/character in a Pentium 
II - 450Mhz machine. 



Upper-case letters (k=4) 


Lower-case letters (k=6) 


€ 


Recog. Rate (%) 


Search Time 


€ 


Recog. Rate (%) 


Search Time 


0.0 


94.64 


20.99 


0.0 


89.81 


20.64 


0.5 


94.64 


8.05 


0.5 


89.81 


8.12 


1.5 


94.64 


1.87 


1.5 


89.74 


1.86 


3.0 


94.28 


0.57 


3.0 


89.49 


0.54 



To obtain a quantitative indication of the improvements that can be expected 
from training sets of increasing size, a series of experiments was conducted with 
training sets from 20,089 to 200,193 digits. The results and recognition speeds 
are shown in Figure ^ The throughputs are measured on a Pentium II - 450 
Mhz machine running UNIX (Linux 2.2.9) and do not include the preprocessing 
time of the test character. 

Given the slow increase of the search times incurred when the database grows, 
an interesting approach to improve the accuracy, keeping at the same time high 
recognition speeds, is to insert new prototypes into the training set. Of course, 
making larger the original database is an evident way to do it, but the ma- 
nual or semi-automated segmentation and labeling procedures needed to build 
a good, large database are very time-consuming. Therefore, a possible approach 
to exploit the information of a given database as much as possible is to per- 
form controlled deformations on the characters to insert them into a new larger 
training set. Similar approaches based on deformations of the data have been 
proposed in m with excellent results. Here, we propose as a faster and cleaner 
option to include the distorted characters in the training set instead of distorting 
the test character in several ways and carrying out the classification of each de- 
formed pattern. We have tested this scheme using slanted versions of the original 
characters, as explained at the beginning of this section. The recognition rate 
using 4 slant angles to obtain a training set of 1,000,965 characters (including 
the 200,193 original ones) improved to 99.43%, from 99.21%, thus cutting the 
error rate by more than one fourth, in the test on SD3 digits, with fc=4 and 
e = 1.5. The search time increased from 2.4 ms/char. to 4.5 ms/char. 

All the results presented have been obtained using only SD3. The same expe- 
riments have been also performed on SD7, taking the whole SD3 as training set. 
In Table 0 the results for both databases are summarized. A last experiment was 
conducted to test if other kinds of deformations could increase the recognition 
rates. Two additional sets of images obtained by eroding and dilating by one pi- 
xel the original (128x128) digit images were appended to the training set (which 
was then 4-|-2=6 times larger than the original one). The results for the “easy 
test” did not improve, and in the “hard test” the recognition rate increased by 
0.31%, reaching 96.59%. 
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Size of the training set Size of the training set 



Fig. 1. Recognition rate at zero percent rejection (left), and number of searches per 
second (right) for two different values of k, and increasing training set sizes. 

Table 3. Summary of recognition rates of digits and upper-case letters, with partitions 
of the NIST SD3 database (“easy test”), and with the whole SD3 for training and SD7 
for test (“hard test”). In some experiments the training set has been augmented with 
4 slanted versions of each character. The parameters used were fc=4 and e = 1.5. 



Training set / Test set 


Rec. rate digits (%) 


Rec. rate uppers. (%) 


SD3 / SD3 (easy test) 


99.21 


94.64 


SD3-l-slanted / SD3 (easy test) 


99.43 


95.78 


SD3 / SD7 (hard test) 


95.16 


89.43 


SD3-|-slanted / SD7 (hard test) 


96.28 


92.34 



6 Conclusions 



The experimental results obtained suggest that fast approximate fc-nearest neig- 
hbours search can be a practical approach to handwritten optical character reco- 
gnition. Previous results m indicating that the error rate is more than halved 
each time the database size is increased tenfold have been confirmed, and preli- 
minary work on the idea of inserting distorted characters into the database has 
been shown to improve significantly the accuracy with moderate increases of the 
search times. 

Obviously, many potentially useful distortions are possible, and there is a 
practical limit on the number of prototypes in the database. Therefore a method 
to reduce its size without compromising the results should be found. Condensing 
methods are clear candidates (see section 1), and a simple way to reduce the 
number of distorted characters entering the database could be to test each one 
before inserting it, and discard points that do not convey discrimination power 
(those with k’ neighbours of the same class, for example). Keeping all the original 
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prototypes, and inserting only artificial characters meeting a certain criterion 

seems a safe and efficient tradeoff, which we plan to test in the immediate future. 
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Abstract. We introduce a family of divergences between ^-probabilistic 
sets, with real supports. The supports are never unbounded to opposite 
sides. We start from weighted and percentiled dissimilarities between ar- 
bitrary unions of compact intervals of real numbers. As an application 
we model the problem of the recognition of a handshape as a metric 
problem between ^-probabilistic sets. The proposed family of divergen- 
ces is a suitable solution to this problem of comparing one handshape 
prototype, a ^-probabilistic set, with one input handshape, a ^-fuzzy 
set. 



1 Introduction 

The purpose of this work is to introduce some horizontal divergences between 
^^-probabilistic sets, horizontal because they are defined from the dissimilarities 
measured between their a-cuts. The vertical point of view is revised in ^ In 
we concentrate our efforts in the horizontal point of view. We assume the 
normality of the fuzzy sets to ensure the non- voidness of all their a-cuts. As the 
a-cuts of a normal fuzzy set with real support is a compact interval in IR, then 
the first we need is to measure the dissimilarity between compact intervals in IR. 
This is investigated in mi where we motivate and propose several dissimilari- 
ties, including one, recursively defined from the dissimilarities measured between 
some of their subintervals. Following subsections comprise several examples of 
applications. The extension of the proposed divergence measures to normal but 
non convex sets is nearly trivial, starting from the corresponding definition of a 
dissimilarity measure between finite unions of compact intervals in IR(cf. Hl'j.ltll . 
In we apply the former horizontal divergences (cf. between probabi- 

lity distributions. In >1,4.51 we apply the recursive schema (cf. >i4. Ill over fuzzy 
numbers. A brief discussion about a possible mixed divergence is presented in 
O Lastly, in we model the problem of the recognition of a handshape as a 
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metric problem between <?-probabilistic sets. The proposed family of divergences 
is a suitable solution to this problem of comparing one handshape prototype, a 
^^-probabilistic sets, with one input handshape, a fuzzy set. 

For later use we remember some definitions. Let 2? be a universal or reference 
(crisp) set. A fuzzy set A over 2? can be identified with its membership function 
(mf), A is always and only a function, from V into [0, Given a € [0,1], 
the a-cut of A is the crisp set “A = {x : A{x) > a} and its levels set, A{A) = 
{a £ [0,1] : A{x) = a}. All the a-cuts of a given set A, form a decreasing 
sequence of nested crisp sets, i. e., Vai,a 2 £ [0,1], oi < a-i A “^A. 

Every fuzzy set can uniquely be represented by the family of all its a-cuts. This 
is usually referred to as a decomposition of the set [2|. In order to do that, let 
consider, for each a £ [0,1], the fuzzy set q,A = a ■ “A, where “A represents 
its characteristic function (viewed as an special membership function). Every 
fuzzy set A, is the standard fuzzy union of all the sets ^A, variying a in A(A) 
0. The height of A is h(A) = sup{A(a;) : x €T>}. The support of A is the set 
supp(A) = {x €T>: A{x) > 0}. We call core of A its 1-cut, core(A) = {x £2?: 
A{x) = 1}. A is called normal if core(A) 0 (i. e., h(A)=l), and subnormal 
otherwise. A is open left if lima,_>_oo A(x) = 1 and lima;_,.+oo A(cc) = 0; open right 
if lim 3 ._,._oo A{x) = 0 and lim 3 ._>_|_oo A{x) = 1; open if it is open left and right] 
and closed if lim 3 ,_>_oo A(a;) = lima;_>_oo A{x) = 0. A is convex if Vai, a 2 £ [0, 1], 
ai > tt 2 “^A C “^A. A fuzzy number is every normal and convex fuzzy set 
in the real line K. At the sintactic level, probability density functions (pdf) may 
be consider as fuzzy sets, exactly those which cardinality is one. 



2 Vertical Approach 

As a general procedure, we can measure the divergence between two real bo- 
unded functions / and g, from T> into [0, 1], measuring the local dissimilarities 
at each point x of the common domain T>, S{f{x),g{x)), and then define a 
divergence D{f,g). To define this D, from the local dissimilarities, is to find 
a way of aggregating all those local informations. Three ways seem to be ob- 
vious: optimistically, Anf(/, ff) = inf{^(/(x), g(a;)) : x GT>}, pessimistically, 
^sup{f,g) = swp{S{f{x), g{x)) : x £2?}, or an averaged one, a kind of mean 
between those extremal cases, Dini{f,g) < Dav{f,g) < Dsup{f,g)- 

This Dav{f,g) could be defined from a i^-mean format 0. Given a set of 
n values X = {cci,... ,Xn}, if p is a continuous and monotonic function in 
[inf X, sup X], the tp-mean of xi, . . . , is defined as the value M,p such that. 



Obviously, inf A < M,p < sup A, and actually from [0,1] 1^1 into [0,1], 
is a monotonic increasing function in all its arguments. If Ci = 1/ \T>\, for all i, 



n 




( 1 ) 
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then we are looking for the solution M^{} of 

ip{M^{S{f{x),g{x)) -.x€V}) = ^ ^if{5{f{x),g{x))) (2) 

I I x^T> 

If (/? (x) = Mip is known as a generalized mean, that we denote ap, 

{<^g{^{f{x),g{x)) : X e V}f = 7 ^ XI (3) 

' ' xeT> 

and such that, in the countable infinite case, we should be in the space of sumable 
series of power /3, and if /3 — >■ 00 , ap = supX. If 5{x,y) = \x — y\ we recognize 
the standardized family of discrete Minkowski distances, for 1 < p < 00 , with 
dooif,g) = limp_,.oo dp{f,g) = sup{|/(x) - 5 (x)| : x €T>}. If V= [a,b] and / and 
g are continuous in T>, then we can ‘add up’ all the local dissimilarities, 

{dp{f,g)f= [ \f{x)-g{x)\^dx (4) 

J a 

where if p = 1, the integral represents the total area between the curves. 

3 Horizontal Approach 

Dubois and Prade jS| proposed a fuzzy-valued comparison index between fuzzy 
sets. 



IC{A,B)= J a/IC{°‘A,°‘B) (5) 

defined from the values of the comparison index acting on their a-cuts. This index 
assumes that the comparison between two fuzzy sets at high membership degrees, 
needs to weight more than that with lower ones. The proposed divergences can 
take into account this fact. We assume the normality of the fuzzy sets to ensure 
the non-voidness of all their a-cuts. As the a-cuts of a normal fuzzy set with 
real support is an interval, then the first we need is to measure the dissimilarity 
between intervals. 



3.1 Dissimilarity Measures between Intervals of Real Numbers 

Let I = [io)h] and J = [jojji] two compact intervals of real numbers. One 
bijection that transform I into J is /(x) = ((jo — ji)x -I- *oji ~ hjo)/(*o ~ * 1 )) 
Observe that / is a composition of an homothety and a translation, with ratios 
Uo - ii)/(*o - * 1 ) and (*oji - iijo)/(*o - * 1 ), respectively. Without any loss of 
generality, we can assume that / and J are subintervals of [0, 1]. This is because, 
given a concrete working environment W, there exist a real number k, such 
that all the possible intervals to compare, are subintervals of [—k,k]. Thus we 
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can stantardize by range, 1/ supyy{<5(v, js)}, with r,s € {0, 1}, or this k can be 
generalized to be an inaccessible real number (actual infinity), a real number too 
large to be described in any floating point number system to be implemented 
on a computer. Similar sign inaccessible real numbers are incomparable since 
we cannot know which real number each of them represents. Then, without any 
loss of generality, we can only work with subintervals of [0, 1], because by /, we 
transform [—k,k] into [0,1]. 

a-Percentage Points If X = [xq, xi] is a compact interval of real numbers, we 
define the a-percentage point in X, as x^a) = (1 ~ a)a:o + axi, with a € [0, 1]. 
Although the percentage points may be unequally spaced, we assume, first, that 
for all a € x^a+i) ~ X(a) is the same value, that is, the percentage points 
are equally spaced. More general situations will be treated further. 

One can think to compare I and J ’to the right’, measuring the distance 
between zg and jo, but then, the dissimilarity will not be sensible to different 
right queues, al-though it should be the right proceeding if the interval space is 
{[s, 1) : X G (0, 1)} or if it is lateralized, i. e., if it is {[a,x) : x G (a, 1)}, with 
fixed a, measuring a certain ’adecuation towards a’. The situation is similar if 
comparing ’to the left’, between A and ji- A first and usual way to mitigate 
those disadvantages, is to use the following extension to intervals, of the family 
of Minkowski standarized metrics, 

-1 1 i/p 

—(o'] 

for 1 < p < oo, and dao{I,J) = max{|zo — jo|, |zi — jij}- Given / and J, if 
1*0 ~ Jo! = 1*1 ~ Jij) i- e., if the interval transformer /, defined above, is a 

—(o'] 

translation, then dp = |*o — Joj) for all 1 < p < oo. There exists three 

—(o'] 

situations, completely different, for which dp ' (/, J) should be the same. First, 
if the interval transformer /, defined above, is ’only’ a translation; second, if / 

is a contraction, and third, if / is a lateral dilation, being / and J concentric, 

-( 2 ) 

in the second case. As the metrics dp , only consider local dissimilarities at the 
endpoints, they assign the same distance value to all the pairs of intervals with 
equal endpoint-difference. But this does not correspond to which our natural 
intuition should ever say. For example, if / represents a range request, and J 
and K are such that J C / and K I, then d{I, J) should must be lower 
than d{I,K). This is the case if / = [0,3], J = [1,2] and K = [1,4], but 
dp (I,J) = dp {I,K) = 1. A first idea to correct it, is to consider also, the 
dissimilarities between the mean points of the intervals, 

- n 

One obvious reason is that, as I and J are concentrics {i \/2 = J 1 / 2 )) we 

— /Q'\ 

divide by a greater quantity, and then dp (/, J) is lower. For the example 
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above, = 2/3 < d^^\l,K) = S^\j,K) = 1 y J) = 0.8615 < 

d^^\l,K) = 1 < = 1.291. 



Modified Minkowski Dissimilarities In general, we reinforce the former 
idea, with the extra information coming from all the local dissimilarities between 
the corresponding a-percentage points. A first family of modified Minkowski 
dissimilarities, is defined as 



d(p) (/, J) 







a 



(A"' 





( 8 ) 



where 1 < p < oo and = if. — jf. (fc g {0, 1}). The case p = 2 is widely 
studied by Bertoluzza, Corral and Salas 0. If p = oo. 



sup {|f(a) J(a)|}- (0) 

aG[0,l] 



With respect to the former observations, if / is a translation, the dissimila- 
rities are the same, but in the other two cases, the dissimilarities J) are 

monotonic increasing with respect to p, towards d(oo)(d, J). 



Weighted Dissimilarities A weighted dissimilarity between the intervals could 
also be defined from a p-mean format (cf. eqn. (JO). In the finite case we define 
it as the solution {I, J) of 



N 

‘P , J)) = “•?’(«) I) 



( 10 ) 



with = 1- If another modified Minkowski dissimilarity, 

that if 1 < p < oo. 



N 



d(p)(d, J) j ^ ^ U>{ol) \i[a) J(a)| 



( 11 ) 






In the continuous case, given I and J, if w is a normalized Lebesgue weight 
measure on ([0, 1],B([0, 1])), and if p ^ A^"^ + a ^ is w-integrable, 

we define a dissimilarity between I and J as the solution J) of the integral 

equation 










+ a (^A{’'^ — Aq"^^ ^ duj{a) 



(12) 



with A^f.''’ = ik — jk {k g {0, 1}), provided the integral exists and ip is continuous 
and monotonic at its domain. This last expresion depends on the definition of 
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w, and there no exists a general solution to the integral proposed. If ^(a) is the 
Radon-Nykodim derivate of the measure w, that is ^(o) = du;{a)/da, then 



^ {dZil, J)) = + a |) da (13) 



(a, /3)-Percentage Intervals If X = \xi,X 2 \ is a compact interval of real 
numbers, we define the {a, (3) -percentage interval of X as d(a,/ 3 ) = [2^(a)> 2 ;(i_/ 3 )] • 
Then we consider the following free-form approximation of ^ in [0, 1], 

M 

C(a) = ^CfcX/,(a) (14) 

k=l 

where T[o,i] = {h,--- ,Im} is a finite collection of disjoint (a, /3)-percentage 
subintervals of [0, 1], assuming that there are r and s, such that 1 < r, s < M 
and ir,Q = 0 and is,i = 1. xi is the characteristic function of I, whose area, with 
respect to a measure /i, is /r(/), for every interval that contains I. The special 
case X[o,o] is <5(a; — a), the Dirac ^-distribution. Observe that it is necessary a 
normalization 

M 

'^CkfJ-ih) = 1 (15) 

k^l 

to include the Minkowski metris dp as special cases. 

Thus, we measure the dissimilarity between I and J, from the dissimilarities 
between some of their (a, /3)-percentage subintervals, proceeding with these in a 
similar manner, and son on. We have then, a recursive schema 






y 



{d^l {S, 



r—1 



I qJ 
1 



(16) 



for k, from 1 to iVy+i, assuming that = 1- The end of the recursion 

is reached in P-|- 1 steps, assumed that the computation of d'^^ {Spr^ ^Pr)^ loi' 
all r = 1, . . . ,Np, is defined from its a-percentage points, i. e., from I(a,i-a)j 
as in eqn. dnu .The functions ip and ipy are continuous and monotonic at the 
corresponding intervals. 

For example, we can modify eqn. 0 in the following way: let be / = [foDi], 
J = bo,ji], <5^0 = [*oDo + e/,o], = [ii — S(q = [jo, jo + 

S(i = [ji — £j,i, ji], then a possible dissimilarity is defined as 



dp\l,J) = 



((df 



id^:\si,,SU)y) ^ (17) 



Observe that when we consider a finite collection Fj = {/i, . . . , Im} of (a, /?)- 
percentage subintervals of an interval / = [io,*i], we assume that there exist r 
and s, such that 1 < r, s < M and Fp = io and isp = i\. 
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3.2 The Horizontal Divergence 

In order to define a divergence between two sets A and B, we have only to 
aggregate the local dissimilarities obtained in the former sections. Using a ^p- 
mean format (cf. eqn. ( 0 ), if A and B are (closed) fuzzy numbers and da{A, B) is 
a dissimilarity measure between the a-cuts of A and B, we propose as divergence 
the possible solution D^^d>{A,B) of 

{D^^^{A, B)) = [ (f {da{A, B)) d(j){a) ( 18 ) 

Jo 

where </) is a Lebesgue measurable function, ip (da{A, B)) is </)-integrable and the 
integral above exists. 

3.3 Horizontal Divergence between ‘Climbing’ Fuzzy Sets 

An instance of non-convex but normal fuzzy sets is that we call climbing fuzzy 
set, a normal fuzzy set such that all the reference points where it reaches a local 
maximum belongs to its core. Because of the normality, all the a-cuts of A are 
non void. If A has real support, each of its a-cuts is a finite union of disjoint 
compact intervals of real numbers, X=Ii U/2U...U/Ti=[ii^o, *i,i] U[i2,o, *2.i]U...U 
[*n,o, in,i] - We can easily extend the dissimilarities defined above to this case. For 
example, the definition of the a-percentage point in X is X(^a) = ik,o + C(L — Sk-i, 
where Sk-i/L <a< Sk/L, and Lh = ih,i~ih,o , Sq = 0, Sk = L1+L2 + ..■ + Lk 
and L = Sn- 

3.4 Horizontal Divergence between Probability Distributions 

As a probability distribution is a special case of open right fuzzy set, with height 
1 , we can define an horizontal divergence between two probability distributions P 
and Q, from the former dissimilarity measure between their a-cuts. If arg P(a) 
and argQ(a) are real numbers, then the a-cuts are “P = [arg P(a), 00) and 
“Q = [arg <5(0;), 00). If we note Marg(P, Q;a) = max (arg P(a), arg Q(a)) and 
iTiarg{P,Q',a) = min (arg P(a), arg Q(a)), then the horizontal divergence bet- 
ween P and Q at level a is da{P, Q) = Marg(P, Q; a) — marg(P, Q', ct)- If argP(a) 
is an interval of real numbers, then “P = [min arg P(a), 00), and so in order to 
compute da{P,Q) it is necessary to use a dissimilarity between intervals of real 
numbers. 

A divergence can be defined in a similar way to compare two open left fuzzy 
sets. In the case of and open fuzzy set, we assume that its support is not K (any 
a-cut is non void), and then “A = (— 00, min arg A(a)[ U [maxarg A(a),-|-oo). 

3.5 Horizontal Divergence between #-Puzzy Numbers 

Another example of interest refers to fuzzy sets. In order to express a greater 
uncertainty, Sambuc 0 proposed the concept of <P-fuzzy set (or interval-valued 
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fuzzy set), with its membership function, A :2?— > is defined as A{x) = 

[A{x),A{x)\, an interval defined by a lower and an upper membership function. 
We say that a <?-fuzzy set A is a <?-fuzzy number if ,4 is a fuzzy number. If instead 
of single intervals, we consider a finite union of them, it is known as P-fuzzy set 
0. For example, denoting /(o)(a:) = min /(a;) and f(i){x) = maxf{x), if A 
is a ^-fuzzy number, then its a-cut is a union of three compact intervals “A = 
[arg(o) 4(x), arg(o) A(x)]U[arg(o) A(x), arg(i) 4(x)]U[arg(i) 4(x), arg(^) A{x)\. We 
can use the former recursive schema to define a divergence between two ^-fuzzy 
numbers, weighting more the middle subinterval [arg^g^ A(a:), argj-]^) A(a:)] (the 
least uncertain one) than the ending intervals. 

4 Mixed Approach 

In the general problem of comparing two fuzzy sets, several situations can be 
considered. If they have a common support, or the intersection between both 
supports is large, perhaps the most natural procedure is to measure the dis- 
similarity between them vertically, but if they have disjoint supports, doing it 
horizontally seems to be more natural. If the intersection of the supports is not 
so large, perhaps we have to think in a mixed approach. In such cases, we must 
aggregate the horizontal and the vertical divergences. For example, if we are in- 
teresting in compare two ^-fuzzy numbers, we can do it whether horizontally or 
whether vertically, using some interval dissimilarity. Observe that any vertically 
approach only consider intervals in [0, 1]. 



5 Application to Handshape Recognition 

Our objective here is to model the problem of the recognition of a handshape as a 
metric problem between ^-probabilistic sets, i. e., a fuzzy set such that A and 
A are probability functions |2| , as part of a wider study of Spanish Sign Language 
The data adquisition is made with two mechanical devices, a sensored glove 
CyberGlove^'^ (which measures flexion and abduction angles, thumb rotation, 
palm arch, wrist pitch and wrist yaw) and a 3D sensor Polhemus Isotrack^*^. 

We call the observed sensor values, the clues. In order to make easier the 
exposition, let assume that the unique sensor devices are those of flexion and 
that the angles are normalized into [0, 1], where 1 means the quality ‘completely 
flexed’. Then each clue refers to this quality. Thus, our reference set is the finite 
set of predicates T> — {si = MPJ(t), S 2 = dJ(t), sg = MPJ{\), 34 = PIJ(i), 
S5 = MPJ(m), sg = PIJ{m), Sr = MPJ(r), sg = PLJ(r), sg = MPJ(p), sig = 
PIJ(p)}, where the constants are t=thumb, i=index, m=middle, r=ring, and 
p=pinkie, and the functions are MPJ=MetacarpoPhalangeal Joint, 7J=Inter- 
phalangeal Joint, and P=Proximal. The predicates mean, for instance, sr = 
MPJ(r) = ‘ring’s MPJ angle is completely flexed’. For example, a naive appro- 
ximation to the ASL (American Sign Language) handshape ‘i’ (pinkie extended 
and the rest completely fiexed) is the fuzzy set, in Zadeh’s notation ciij Si, 
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where ag and aio are approximately 0, and the others are approximately 1. 
Observe that because of the anatomical configuration of the human hand and 
the characteristics of ‘i’ to be recognized, although flexed, it is enough that the 
thumb is close to the index and the thumb-tip is below the knuckle line. 

In general, the handshapes of a sign language have not to be so precise. 
Fingers do not have to achieve an exact position but an interval of possible 
positions, for example, signing ‘w’ in ASL, index, middle and ring are extended, 
but the angle between the possible directions (viewing the fingers as vectors 
from knuckles) has not to be exactly zero: differences of about 10-15 or even 20 
degrees are oftenly interpreted as ‘w’. Thus, a fuzzy set seems to be a suitable 
representation of the uncertainty for a handshape. 

Assume that, in general, we have three finite sets Q, S and 8, of handshapes, 
sensor devices and experts signers, respectively. Given a handshape q to be 
learned by the system, for every expert e G f, g is defined as a set of |5| 
probability distributions Dq e = {-Fg,e,s : s G 5}. Once performed q by all of 
the experts, q is defined as an overset {Dq^f. : e G 8} of subsets of probability 
distributions. Observe that if we suppose a finite range for each sensor device (a 
quantization of [0, 1]), the distributions in Dq g are discretes. 

Observe also that given an expert e, and a sensor device s, we have a pro- 
bability mass function pe^s- We have assigned, heuristically, to each expert e, a 
confidence level ag G [0,1], so the ag ■ 100% more probable executions of q are 
the only accepted (we call the significance ae-cut ofpg^s, and we denote it “®Pe,s, 
the set that comprises all of them, given e and s).Thus, it seems reasonable to 
estimate the handshape prototype signed by a given expert e, as the ^-fuzzy set 

s£S 

where the sample range of the sensor device s, given that the expert e has 
signed the handshape q, is estimated as [s(e),s(e)j, with s(e) = min{sj : i = 
1, . . . , n(e, g) A Sj G “'=Pe.s} and s(e) = max{s* : z = 1, . . . , n(e, g) A G °‘‘Pg,s} 
and n{e,q) is the number of times that e has signed g (the sample size for e 
and g) . Thus, given a set of experts 8, once performed all the handshapes by all 
of them, we have |Q| classes of handshapes, all of them with \8\ elements, the 
former fuzzy sets qg. 

Each execution of a handshape by a signer (the input pattern to the system) 
is a plain fuzzy set of observed values {o(s) : s G 5} from the sensor devices, 
although really, because of possible errors in the measure process, we are only 
sure that the true value belongs to some interval [o(s) — Ag, o{s) -I- Ag], where 
Ag is an estimation of the error associated to s. Then, each input pattern could 
be represented as a fuzzy set. 

At this point, the recorded patterns and the input one are represented by 
<?-fuzzy sets. The problem of the classification of the input can be solved by 
classical thecniques as ‘nearest neighbours’, evaluating dissimilarities between 
the input and all the pattern in a class, for the different classes. We can use, for 
example, the former recursive proposal as in 
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5.1 Handshapes as ^-Probabilistic Sets 

For each class of recorded patterns (^-fuzzy sets) we define a prototype as a 
<?-probabilistic set. Given a probabilistic space and (f2c,Bc) a space 

of characteristics, we can define a <1^ -probabilistic set A hy a pair of mappings 
A,A-.V X 17 — >■ 17c; where A, A(x, .) are measurables in for all x G V. 

Given q, we define a prototype as the ^^-probabilistic set q,q :5xl7 — >• l7c, 
with 17 = {{[s(e), s(e)] : s G 5} : e G and 17c = [0;1]- Thus, the problem 
of the recognition of a handshape is modeled as a metric problem between <?- 
probabilistic sets. The family of divergences defined above is a suitable solution 
to the problem of comparing one handshape prototype, a (^-probabilistic set, with 
one input handshape, a fuzzy set. A <P-fuzzy set is a ^-probabilistic set such 
that all lower and upper probabilities are Dirac deltas. Given two (^-probabilistic 
sets A and B, we can calculate a divergence between them in several ways, for 
example, we can compute 

0|5| {O 2 {D{A{s),B{s)),D{A{s),B{s))) : s G S} (20) 

where I? is a divergence between probability distributions and and O 2 are 
aggregation operators of arities |5| and 2. 

An alternative is based upon the definition of expected set of a (^-probabilistic 
set. Given a ^-probabilistic set A, we can average over 17, obtaining its lower 
and upper mean-value membership functions. The expected set (a (p-fuzzy set) 
of A is defined as 



E{A) 



E{A){s),E{A){s) 




A(s, uj)dP{Lo), 




A(s, uj)dP{ui) 



(21) 



Given two (^-probabilistic sets A and B, we can demote them to their expec- 
ted ^-fuzzy sets and compute 



0|5| {D{E{A){s),E{B){s)):sGS} 



where I? is a dissimilarity between intervals of real numbers. 



(22) 
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Abstract. The author presents a novel feature of 2D and 3D images invariant 
to similarity transformations and robust to noise on the basis of the bispectrum. 
The invariant feature is applied to the classification of texture images suffering 
from rotation, scaling and noise. Computer experiment shows that about 90 % 
correct classification ratio is obtained for 5 kinds of 2D natural textures and of 
3D brain images rotated in arbitrary degree, scaled up to double and with the 
white Gaussian noise of 0 dB SNR. The feature can also be used to the 
estimation of the rotation angles of texture images. 



1. Introduction 

Texture classification invariant to geometric transformations is of importance in many 
practical applications [1]. Several methods for the classification invariant to the 
rotation and scaling of images have been proposed [2]-[15]. However, the robustness 
to noise has not been considered except for [8], [12]. On the other hand, the third- 
order correlation and the bispectrum are robust to the additive noise of any 
symmetrical distribution [16], [17]. Some methods based on the third-order statistics 
have been applied to noisy texture classification [18], [19] and texture synthesis [20]. 
The results on the classification of texture images under rotation, scaling and noise 
using the third-order statistics have not been obtained, however, as far as the author 
knows. 

In this study the author presents a bispectrum-based feature of 2D and 3D patterns 
invariant to similarity transformations and robust to noise. The invariant feature is 
then applied to the classification of 2D and 3D texture images suffering from rotation, 
scaling and additive noise. It is also shown that the invariant feature is applicable to 
the estimation of the rotation angles of texture images. 



2. Invariant Feature Based on the Bispectrum 

The derivation of the invariant feature from the bispectrum and the effective 
calculation method are as follows. 



F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 787-795, 2000. 
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Let/(jc) be 2D or 3D image data and F(co) be its Fourier transform. 

F(co) = l/(x)exp(-jco-x)djc ( 1 ) 

The bispectrum of/(x) is defined by the triple product of F(co) 

B(co„co2) = F(cOj)F(co2)F(-co,-co2) (2) 

which is invariant to the shift of image data. 

When image data is rotated at an arbitrary point, the hispectrum is rotated at the 
origin by the same angle. When image data is scaled, the bispectrum is scaled at the 
origin by the inversely proportional amount. We then integrate the bispectrum in the 
(cOpCO^) space on condition that IcoJ/lco^h'", co,-co 2 /(l®JI® 2 l)=cos 0 . 

/’(r,0) = jl, 0 B(co,,co 2 )dco,dco 2 (|coJ/|co 2 l='‘: cOj-co2/(l®,||co2l)=cos0) (3) 

This two-dimensional function I\r,Q) on the (r,0) plane represents the amount of the 
sinusoids of frequency components cOj, CO 2 cOj-i-cOj which have the same ratio r of 
length and the same angle 0 in image data. 

A feature /(r,0) of image data invariant to similarity transformations (shift, rotation 
and scaling) is obtained through normalization. 

/(r,0)=/’(r,0)/(ll/’\/-,0))'“ (4) 

Note that this feature is also invariant to linear changes in the gray-scale values of 
image data. 

To avoid computational complexity in calculation and interpolation of the 
bispectrum of high-dimensional (2D or 3D) data, a simple and effective method of 
calculating the invariant feature is introduced. 

Let fix) (x=(x,y), x, y = 0, •••, N-1) be 2D digital image data of NxN pixels. 
(Extension to 3D images is straightforward.) The Fourier transform F(co) of image 
data is calculated with 2N x2N point FFT using the Gaussian window exp(-(jc-ja,) (x- 
|a.)/(2o^)) (\l=(N/2,N/2), a=N/3) and padding zero outside of NxN data. For each 
frequency pair (cOpCOj) -NI2^^,(i)<NI2 (integer)), the triple product 

E(c0j)F(c02)F(-C0j-C 02) is calculated. (The value of non-integer) is calculated 

with the bilinear interpolation.) Its value is then added up into appropriately divided 
classes of (r,0), where |coJ/|c02l=^ and cOj-cO2/(|®J|cO2l)=cos0). The obtained table 
(histogram) on (r,0) corresponds to r(r,Q) and the normalization leads to an estimate 
of the invariant feature I(r,Q). The calculation is done in D(A^) time and in 0(N^) 
space. (For cl-dimensional data, the computational time is in 0(N^) and the space is in 
OiN'), where N‘‘ is the number of pixels of image data.) 



3. Texture Classification Experiments 

The invariant feature is applied to the classification of 2D natural texture images and 
3D brain images under rotation, scaling and additive noise. 
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3.1 2D Natural Textures 

Five natural texture images: (a) D4 (Pressed cork), (b) D12 (Bark of tree), (c) D15 
(Straw), (d) D17 (Herringbone weave) and (e) D84 (Raffia looped to a high pile) are 
taken from the Brodatz album [21]. The pictures of the textures are digitized to 8 bits 
gray-scaled Image data of 260x240 pixels and their central parts of 128x128 pixels 
are used in the experiment. 

Figure 1 shows the invariant features /(r,0), where Re(/(r,0)) is plotted since 
Im(/(r,0))~O owing to f(-co)=-f(co). The ratio r is ranged from 1.0 to 10 by 10°^ and 
the interior angle 0 is ranged from 0° to 180° by 18°. The graphes of the invariant 
features have different shapes. 




Fig. 1. Invariant feature 7(r,0) of 2D textures. 

Table 1 shows the RMS distances between the features of the original images and 
those of transformed and noisy images. The transformed images are made with the 
bilinear interpolation of the original images. For three scaling factors (1.0, 1.5, 2.0), 
the images are shifted from 0 to 0.5 by 0.1 and are rotated from 0° to 45° by 5°. A 
hundred noisy images are made by adding the white Gaussian noise N(0, 40^) (SNR is 
about 0 dB) to the original images. All the transformed and noisy images are correctly 
classified with the RMS distance to the invariant features of the original images. 
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Table 1. RMS distance of the invariant feature of transformed 2D texture images. 



(a) (b) (c) (d) (e) 







Max. 


Aean 


Max. 


/lean 


Max. 


/lean 


Max. 


/lean 


Max. 


/lean 






Min. 




Min. 




Min. 




Min. 




Min. 




(a) 


X1.0 


0. 23 


0. 16 


1. 73 


1. 70 


0. 98 


0. 93 


1. 50 


1. 44 


1.25 


1.22 






0. 0 




1. 67 




0. 86 




1. 39 




1. 19 




(b) 


XI. 0 


1. 74 


1. 68 


0. 57 


0. 37 


1. 77 


1. 69 


1. 22 


1. 17 


1.54 


1.49 






1. 60 




0.0 




1.59 




1. 09 




1.43 




(o) 


XI. 0 


1. 00 


0. 95 


1.80 


1. 76 


0.51 


0. 34 


1. 44 


1. 42 


1.53 


1.43 






0. 89 




1. 72 




0.0 




1. 38 




1.31 




(d) 


XI. 0 


1.44 


1.42 


1. 12 


1.09 


1.45 


1.44 


0. 23 


0. 17 


1.73 


1.72 






1.39 




1.07 




1.41 




0.0 




1.71 




(e) 


XI. 0 


1.33 


1.22 


1.53 


1.47 


1.46 


1.38 


1.76 


1.72 


0.51 


0. 30 






1. 16 




1.37 




1.33 




1.68 




0.0 




(a) 


XI. 5 


0. 31 


0. 27 


1.73 


1.71 


0. 89 


0. 83 


1.54 


1.49 


1.20 


1. 16 






0. 21 




1.68 




0. 78 




1.45 




1. 11 




(b) 


XI. 5 


1.80 


1.76 


0. 65 


0. 62 


1.84 


1.74 


0. 98 


0. 92 


1.61 


1.54 






1.72 




0. 57 




1.65 




0. 87 




1.44 




(o) 


XI. 5 


0. 91 


0. 85 


1.81 


1. 78 


0. 68 


0. 49 


1. 47 


1. 42 


1.58 


1.50 






0. 81 




1. 76 




0. 44 




1. 39 




1.42 




(d) 


XI. 5 


1. 48 


1. 42 


1. 15 


1.09 


1.43 


1.39 


0. 32 


0. 31 


1.75 


1. 72 






1. 35 




1.05 




1.34 




0. 29 




1. 70 




(e) 


XI. 5 


1. 39 


1. 34 


1.46 


1.41 


1.45 


1.41 


1. 65 


1. 63 


0. 64 


0. 59 






1. 24 




1.39 




1.34 




1. 60 




0. 49 




(a) 


X2.0 


0. 52 


0. 38 


1. 74 


1. 68 


0. 86 


0. 83 


1.51 


1. 45 


1.24 


1. 15 






0. 28 




1. 62 




0. 77 




1. 40 




1.07 




(b) 


X2.0 


1.93 


1.85 


1.02 


0. 96 


1.85 


1.74 


1. 18 


1.06 


1.72 


1.61 






1.80 




0. 86 




1.56 




0. 96 




1.46 




(o) 


X2.0 


0. 89 


0. 78 


1.81 


1.78 


0. 64 


0.51 


1.49 


1.46 


1.46 


1.42 






0. 66 




1.75 




0. 44 




1.41 




1.36 




(d) 


X2.0 


1.47 


1.44 


1. 17 


1.09 


1.48 


1.38 


0. 49 


0.42 


1.83 


1.76 






1.42 




1.00 




1.30 




0. 38 




1.70 




(e) 


X2.0 


1.55 


1.28 


1.53 


1.46 


1.65 


1.36 


1.77 


1.52 


1.00 


0. 89 






1.04 




1.39 




1.04 




1.36 




0. 74 




(a) 


Noise 


0. 25 


0. 17 


1. 78 


1. 71 


0. 97 


0. 90 


1. 56 


1. 43 


1.35 


1.25 






0. 10 




1. 62 




0. 80 




1. 29 




1. 12 




(b) 


Noise 


1. 78 


1. 71 


0. 33 


0. 22 


1.83 


1. 76 


1. 83 


1. 76 


1.59 


1.50 






1. 64 




0. 13 




1. 68 




1. 68 




1.40 




(o) 


Noise 


0. 98 


0. 89 


1.81 


1. 76 


0. 24 


0. 15 


1. 49 


1. 44 


1.48 


1.40 






0. 83 




1. 73 




0. 09 




1. 38 




1.31 




(d) 


Noise 


1. 50 


1. 44 


1. 15 


1.09 


1.50 


1.44 


0. 23 


0. 14 


1. 77 


1. 73 






1. 37 




1.03 




1.39 




0. 08 




1. 66 




(e) 


Noise 


1.35 


1.26 


1.59 


1.51 


1.51 


1.40 


1.80 


1.72 


0. 37 


0.21 






1.11 




1.40 




1.26 




1.62 




0. 11 





Next, classification experiment on image data with random transformations and 
additive noise is done. The original image data are randomly shifted in [-40.0, 40.0], 
rotated in [0°, 360°] and scaled hy the factor in [1.0, 2.0]. The white Gaussian noise 
with the standard deviation (S.D.) 40, 80 and 120 (SNR ~ 0, -6 and -10 dB, 
respectively) is added to the transformed image data. For each image, 100 
transformed and noisy images are made and classified with the RMS distance to the 
invariant features of the original images. 

The correct classification ratios are shown in Table 2. More than 90% correct 
classification ratio is achieved for the randomly transformed images with the noise up 
to -6 dB. Note that the simple RMS distance to the original data is used here. It can be 
shown that the performance is still improved when optimization and learning 
techniques are applied. 



Table2. Correct classification ratio for 2D texture images. 



S. D. of noise (SNR (dB)) 
0.0 (CO) 

40.0 (0) 

80.0 (-6) 

120.0 (-9.5) 



correct classification ratio (%) 

Too 

94 

90 

67 
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3.2 3D Brain Images 

The Normal Brain Database in the BrainWeb at McGill University [22] is used for 3D 
texture images. The database consists of the realistic MR images of the brain of 
181x217x181 voxels (1 mm size), which are generated using the MRI simulator. Five 
3D image data of 16x16x16 voxels are taken from the brain image: (a) right anterior 
region, (b) right posterior region, (c) left anterior region, (d) left posterior region, (e) 
central region. 




Fig. 2. Invariant feature /(r,0) of 3D brain images 



The invariant features I(r,Q) of the original images are shown in Fig. 2. Table 3 
shows the RMS distances between the features of the original images and those of 
transformed and noisy images. The images are shifted from 0 to 0.5 by 0.1 along the 
x-axis and rotated from 0° to 45° by 5° about the x-axis and by 15° about the y, z-axes 
for each scaling factor. Those of 100 images with the white Gaussian noise N(0, 25^) 
(SNR is about 0 dB) are also shown. Several transformed images magnified by double 
are not correctly classified with the RMS distance. The scaling gives variations in the 
invariant features larger than those in 2D images since the size of the 3D images are 
small. 

Table 4 shows the correct classification ratios for randomly transformed and noisy 
images. The original image data are randomly shifted in [-0.5, 0.5] along the x, y, z- 
axes and rotated in [0°, 360°] about the three axes, with or without scaling by the 
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factor in [1.0, 2.0] and adding the white Gaussian noise N(0, 25^). For each image, 
100 image data are made and classified with the RMS distance of the invariant 
features. More than 80% correct classification ratio is achieved for the additive noise 
up to 0 dB. 



Table 3. RMS distance of the invariant feature of transformed 3D brain images. 





Max. Mean 
Min. 


Max. Mean 
Min. 


Max. Mean 
Min. 


Max. Mean 
Min. 


Max. Mean 
Min. 


(a) XI. 0 


0.24 0.13 
0.00 


1.97 1.93 
1.89 


1.07 0.90 
0. 76 


1.66 1.55 
1.47 


0.82 0.71 
0. 62 


(b) XI. 0 


1.99 1.97 
1.95 


0.47 0.20 
0. 00 


1.77 1.64 

1.55 


1.29 1.11 
1.00 


1.96 1.94 
1.89 


(c) XI. 0 


1.20 1.02 
0. 86 


1.60 1.49 
1.36 


0.36 0.16 
0.00 


0.99 0.83 
0. 67 


1.45 1.34 
1.24 


(d) XI. 0 


1.73 1.63 
1.57 


1.02 0.94 
0. 76 


1.10 0.96 
0. 89 


0.31 0.10 
0.00 


1.74 1.65 
1.60 


(e) X1.0 


0. 75 0. 63 
0. 47 


1.99 1.97 
1. 96 


1.44 1.29 
1. 20 


1.74 1.64 
1. 60 


0.34 0.17 
0. 00 


(a) X1.5 


0. 62 0. 35 
0. 09 


1.99 1.98 
1. 95 


1.43 1.21 
0. 92 


1.86 1.74 
1. 54 


0. 66 0. 60 
0. 57 


(b) X1.5 


1.99 1.95 
1. 92 


0. 59 0. 40 
0. 33 


1.76 1.66 
1. 61 


1.45 1.24 
1. 13 


1.97 1.95 
1. 92 


(c) X1.5 


1.67 0.98 
0. 61 


1.73 1.49 
0. 78 


0. 98 0. 45 
0. 25 


1.24 0.92 
0. 52 


1.81 1.30 

1.07 


(d) XI. 5 


1.78 1.58 
1.42 


1.18 0.97 
0. 64 


1.17 0.90 
0. 69 


0. 43 0. 24 
0. 11 


1.80 1.63 
1.52 


(e) XI. 5 


0.80 0.65 
0. 25 


1.98 1.94 
1.79 


1.55 1.28 
0. 80 


1.89 1.65 
1. 16 


0.75 0.43 
0. 26 


(a) X2. 0 


1.67 0.92 
0. 58 


1.98 1.84 
1.02 


1.76 1.53 
0. 89 


1.95 1.77 
0. 47 


1.64 0.78 
0. 57 


(b) X2. 0 


1.96 1.51 
0.48 


1.83 0.98 
0. 32 


1.57 1.16 
0. 51 


1.40 1.11 
0. 64 


1.95 1.73 
0. 95 


(c) X2. 0 


1.90 0.85 
0. 19 


1.97 1.52 
0. 43 


1.45 1.15 
0. 74 


1.82 1.36 
0. 64 


1.87 1.07 
0. 59 


(d) X2. 0 


1.83 0.76 
0. 33 


1.95 1.64 
0. 53 


1.44 1.16 
0. 50 


1.87 1.42 
0. 17 


1.85 1.05 
0. 73 


(e) X2. 0 


1.80 1.27 
0. 15 


1.95 1.31 
0. 69 


1.85 1.31 
0. 60 


1.85 1.21 
0. 12 


1.80 1.24 
0. 57 


(a) Noise 


0.20 0.10 
0. 04 


1.96 1.94 
1.89 


1.06 0.93 
0. 77 


1.66 1.57 
1. 46 


0.80 0.70 
0. 60 


(b) Noise 


1.97 1.95 
1.87 


0.27 0.14 
0. 08 


1.66 1.56 
1.38 


1.21 1.02 
0. 87 


1.97 1.95 
1.92 


(o) Noise 


1.11 0.92 
0. 72 


1.70 1.56 
1.42 


0.27 0.15 
0. 07 


1.09 0.92 
0. 73 


1.40 1.27 
1. 10 


(d) Noise 


1.61 1.59 
1.55 


1.05 1.00 
0. 96 


0.94 0.91 
0. 86 


0.07 0.04 
0.02 


1.64 1.61 
1.58 


(e) Noise 


0.94 0.72 
0. 52 


1.97 1.95 
1.89 


1.45 1.29 
1. 14 


1.72 1.60 
1.50 


0.33 0.15 
0. 08 



Table 4. Correct classification ratio for 3D brain images. 



S. D. of noise (SNR (dB)) 


Sea 1 i ng factor 


correct classification 
ratio (%) 


0.0 (co) 


XI. 0 


98 


25.0 (0) 


XI. 0 


99 


0.0 (co) 


xl. 0 -^x 2.0 


87 


25.0 (0) 


xl. 0 -^x 2.0 


86 



4. Estimation of Rotation Angle 



The invariant feature is applied to the estimation of the rotation angles of 2D texture 
images. 

First, we take a complex-valued function //x,j) obtained from the gradient field of 
images. 
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fM,y) = Ux,y)+iff,x,y), fXx,y) = f{x,y)+f{x,y+ i)-fi,x + 1 ,y)-fi.x + 1 1 ) 

fi(x,y) = f(x,y)-f(x,y + 1 )+f(x + 1 ,y)-f(x + 1 1 ) 

The image /(xj; 0) rotated by 0^ has the phase change by 0^. 

fXx,y, 0) =fXx,y, 0)exp(i6>) 

It can be shown that the phases of the bispectmm and the invariant feature I{r,Q) are 
also changed by 0^. The LMS estimate 0^ of the rotation angle 0^ of the image is given 
by 

6» = Tan' { I(-Re/( 6>)Im/(0)+Im/( 6>)Re/(0)) 

/I(Re/(6>)Re/(0)+Im/(6>)Im/(0))} ^ 

(\l(r,0, 0)-I(r,0, O)exp(i0^)P ^ min.) 

Figure 3 shows the LMS estimates for the rotation angles of the 2D texture images 
used in 3.1, where (a) without noise, (b) with N(0, 20^), (c) with N(0, 40'') and (d) 
scaled by 1.5. The mean and standard deviation of the estimation errors are shown in 
Table 5. The standard deviation of the estimation errors is less than 10° for the noise 
up to 6 dB and for the scaling up to 1.5. 



(5) 

(6) 




Fig. 3. Estimation of rotation angle of 2D texture images. 
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Table 5. Mean and S.D. of the estimation errors of rotation angles. 

S. D. of noise (SNR Scaling factor Mean and S. D. of estimation error 

(dB)) (degree) 



(a) 


0.0 (oo) 


XI. 0 


1.3 


7. 1 


(b) 


20.0 (6) 


XI. 0 


-0. 2 


7. 7 


(o> 


40.0 (0) 


XI. 0 


0. 96 


23. 2 


(d) 


0.0 (oo) 


XI. 5 


3.3 


8. 9 



5. Discussion 

The invariant feature based on the bispectrum is applied to the classification of 2D 
and 3D texture images suffering from rotation, scaling and additive noise. The high 
performance is obtained for arbitrary rotation, scaling up to double and additive noise 
up to 0 dB. The feature is also invariant to linear changes in the gray-scale values of 
image data. 

The RMS distance of the invariant features between the transformed images and 
the original ones are used and no a priori knowledge of the transformations is 
assumed in the classification experiment. The use of optimization techniques can 
increase the classification performance and make it possible to classify similar texture 
images the RMS distance of the features of which is more small. 

The advantage of the features based on the third-order statistics is the robustness to 
noise. The invariant feature is effective even for the additive noise less than 0 dB 
SNR, where most approaches based on the second-order statistics are not applicable. 

Computational complexity in the calculation of the bispectrum is avoided to some 
extent by using the voting method. The computational time is in 0{N^) and the space 
is in 0{N‘), where N‘‘ is the number of pixels of image data {d\ dimension). The 
calculation for one image data is done in several minutes with the SUN workstation in 
the experiment, which is in a range of practical use. 
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Abstract. We aim to determine which of a set of competing models is statisti- 
cally hest, that is, on average. A way to define “on average” is to consider the 
performance of these algorithms averaged over all the training sets that might be 
drawn from the underlying distribution. When comparing more than two means, 
an ANOVA F-test tells you whether the means are significantly different, but it 
does not tell you which means differ from each other. A simple approach is to 
test each possible difference by a paired t-test. However, the probability of 
making at least one type I error increases with the number of tests made. Multi- 
ple comparison procedures provide different solutions. We discuss these tech- 
niques and apply the well known Bonferroni method in order to determine the 
optimal degree in polynomial fitting and the optimal number of hidden neurons 
in feedforward neural networks. 



1 Introduction 

We consider the general problem of determining which of a set of competing mod- 
els is best. Although there is active debate within the research community regarding 
the exact meaning of "best", statistical approaches are reasonable. Statistical approach 
to model selection tries to find which model is better on average. A way to define “on 
average” is to consider the performance of a given algorithm averaged over all the 
training sets that might be drawn from the underlying distribution. In a real situation, 
the underlying distribution is unknown, and we only have a finite size sample to work 
with. 

In the following sections, we first describe the design of a randomized data col- 
lecting procedure required to control the different sources of variation. This design 
will allow us to generate several training sets following the underlying distribution, 
taking into account the different sources of variation that could exist 1^. After col- 
lecting the data, our goal will be to make inferences about k population means. A1 
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though the ANOVA test allows us to reject the null hypothesis that the groups’ means 
are all equal, it does not pinpoint where the significant differences lie. Multiple t tests 
are not appropriate because the probability of a type I error increases with the number 
of comparisons made [0. Statistical methods to compare three or more means while 
controlling the probability of making at least one type I error are called multiple com- 
parison procedures. We briefly discuss these methods, including Fisher's LSD, Tu- 
key's HSD, Bonferroni, Newman-Keuls, Duncan and Scheffe procedures and com- 
ment its potential advantages. 

We will show how it is possible to apply these techniques to model selection 
through two examples. First, this model selection strategy is applied to determine the 
optimal degree in polynomial fitting. Results show that the optimal degree obtained is, 
in fact, the degree of the polynomial from which data are generated. Second, the same 
procedure is applied to determine the number of hidden neurons in feedforward net- 
works. Obviosly, in this case, we can not validate the results. 



2 Design of the Experiment 

In order to compare different models, we must guarantee the independence of the 
results by controlling the sources of variation which affect the behaviour of the mod- 
els. Dietterich has analysed the sources of variation which a good statistical test 
should control. These sources of variation are controlled as follows: 

• Variation resulting from the choice of the training and test data sets. On any par- 
ticular randomly drawn test and training data sets, one model may outperform an- 
other. Given that we are studying how the models behave in average, we should 
repeat the estimation of the error over different training and test sets, and deter- 
mine if any mean of errors is significantly smaller than the others. In order to 
compare different means, we recommend at least 30 measures to reduce the stan- 
dard error for the comparisons. 

• Variation resulting from the size of the test and training data sets. The perform- 
ance of two different models changes smoothly with changes in the size of the 
training set. If a large amount of data is available, it is possible to set some of it 
aside to serve as a test set for evaluating the performance of the treatment. How- 
ever, in most situations, the amount of data is limited and the use of all of it as in- 
put set is needed. Cross-Validation and Bootstrap procedures are the most com- 
mon forms of resampling. However resampling means that each pair of training 
sets shares a high ratio of the samples. This problem of overlapping can be solved 
by using two-fold cross-validation, which involves the partition of the data set 
into two disjoints sets ||^, training and test sets, of the same size. 

• Internal randomess in the estimation of the model parameters. If the estimation of 
parameters is analytical and its determination is unique, this step can be omitted 
because there is no internal randomness. However, in an iterative approach the re- 
sults depend critically on the starting state. Most of the iterative procedures suffer 
from internal randomness due to the initialisation of the parameter set to small 
random values. This parameter set depends on the model complexity, so it is dif- 
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ferent in value and number for each model. Hence, to control this source of varia- 
tion, several starting states are taken for each training data set. We focus our study 
in the model behaviour on average, so the extreme cases (the minimum and the 
maximum error estimates) are excluded and the mean error of the remaining re- 
sults is considered to be the actual error of the model. 

The complete strategy repeats a similar process 30 times: random splitting of data 
into a pair of equal sized portions and two-fold cross-validation for the estimation of 
the error for each model. The whole process is summarized as follows: 

for v:=l to 30 

shuffle (Data) // random split of Data 

(S1,S2) : =Partition (Data) 

for k:=l to M // M=number of competing models 

for fold:=l to 2 // Two-Fold CrossValidation 

for i : =1 to 10 // When internal randomness exist 

W := ParameterEstimate (SI) 

PError(i) := ErrorEstimate (W, S2 ) 
end 

Error (fold) : =RobustMean (Perror) 

Swap (SI , S2 ) 
end 

ModelError (k, v) =Mean (Error) 
end 

end 



3 Testing for Differences among Means in Groups 

Once we have obtained a set of error measures for each model that controls all the 
possible sources of variation of the experiment, we should compare them. First, we 
consider the problem of determining whether the means of error measures can be 
statistically considered equal or different. We study the assumptions that should be 
verified in order to make any valid inference. Second, we consider a more difficult 
problem: given that we know that error means are not equal, which of them is signifi- 
cantly smaller than the others? 



3.1 Are the Means Equal? 

As a first step, we may consider the use of a t-test to assess the means equality of 
two populations. But, if we are interested in testing whether the means of more than 
two populations are not significantly different, we must use a procedure called the 
analysis of variance (ANOVA)[7]. 

Analysis of variance is a parametric technique that tests the null hypothesis that the 
population means are equal to each other. However, in order to make conclusions 
about population means, several assumptions should be taken into account: 
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• All k population probability distributions should be normal. While this assump- 
tion is not relevant with large sample sizes, it is important with small sample sizes 
(specially with unequal sample sizes). This assumption has been tested using the 
method of Kolmogorov-Smirnov and we have always found that the distribution 
of results follows a Gaussian curve. 

• The k population variances should be equal. This assumption is not meaningful 
when all the models have the same (or almost the same) number of error subjects, 
but it is very important when this number differs. In our method the number of er- 
ror measures is the same in all the models. 

• The samples from each population should be random and independent. This as- 
sumption depends strongly on the design of the experiment. As the sources of 
variation have been taken into account, we assume random and independent data 
samples. Strictly speaking, the independence of the samples is not verified in our 
design, given that different results have been obtained from splitting randomly the 
finite sized available data. However, considering pairwise comparisons, the vio- 
lation of this assumption is secondarily considered. 

When assumptions for analyzing collected data from a completely randomized de- 
sign are violated, any inferences derived from the ANOVA are suspect. An alternative 
technique to use in this situation is the nonparametric Kruskal- Wallis test. 



3.2 Which Means Are Equal? 

When comparing more than two means, an ANOVA F-test tells you whether the 
means are significantly different, but it does not tell you which means differ from each 
other. The first idea that comes to mind is to test each possible difference by a paired 
t-test. However, this approach increases the probability of making at least one type I 
error with the number of tests made. Statistical methods to compare three or more 
means while controlling the probability of making at least one type I error are called 
multiple comparison procedures. 



4 Multiple Comparison Procedures 

Multiple comparison procedures compare the average effects of three or more 
treatments to decide which treatments are better, which ones are worse, and by how 
much, while controlling the probability of making an incorrect decision. A wide range 
of multiple comparison procedures is commonly present in the literature[^. 

The Fisher’s Least Significant Differences(LSD) procedure begins with a one-way 
analysis of variance. Only when the overall F-ratio is statistically significant we carry 
out all possible t-tests. Some authors refer to this procedure as Fisher’s Protected LSD 
to emphasize the protection provided by the F-ratio. 

Tukey’s Honestly Significant Differences(llSD) follows the path of Student, de- 
termining the distribution of the largest t statistic when many groups are compared and 
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there are no underlying differences. It is a test specifically designed for pairwise com- 
parisons when the sample sizes are equal. Tukey and Kramer independently propose a 
modification for unequal cell sizes. Two means are considered significantly different 
by the Tukey-Kramer criterion if 

|ty| > q(a;k;u) ( 1) 

where q(a;k;o)is the a-level critical value of a studentized range distribution of k 

independent normal random variables with v degrees of freedom. 

Bonferroni[2] is a well known and easy to apply follow-up analysis of the Anova 
F-test. This procedure adjusts the observed significance level based on the number of 
comparisons we are making. This technique compares the difference between two 
treatment means to a critical difference. This difference depends on the number of 
observations in each treatment, the significance level, the variability unexplained by 
the differences between the sample means, and the total number of treatments to be 
compared. If the difference between the sample means exceeds the critical difference, 
there is sufficient evidence to conclude that the population means differ. Bonferroni t 
test declares two means to be signicantly different if: 

\t.j\>t{e-,v) (2) 

where 

(3) 

k(k-l) 

for comparisons of k means. 

The Student-Newman-Keuls (SNK) procedure is an attempt to compromise be- 
tween LSD and HSD. Like the Tukey HSD it is based on a studentized range distribu- 
tion. This procedure is more powerful than the Tukey HSD and is better at controlling 
the experimentwise error rate (EER). However it is less often used, mainly for two 
reasons. First, it cannot be used to construct confidence intervals for differences be- 
tween means. Second, there are patterns of population means which lead to an inflated 
EER. 

Duncan’s method looks much like the SNK procedure and gives many more sig- 
nificant differences. It is only very slightly more conservative than Fisher’s LSD, and, 
in practice, in the majority of the cases they lead to the same conclusions. 

A technique slightly less conservative than Bonferroni is the Sidak test given by 

|ty|>t(e;v) (4) 



where 



2 

6 = l-(l-a)^ 



(5) 



for comparisons of k means. 
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Scheffe proposes another method to control the maximum error rate under any 
complete or partial null hypothesis. Two means are declared significantly different if 

\t,j\>^{k-l)F{a;k-l,v) (6) 

where F(a;fe -1; vj is the a level critical value of an F distribution with k-1 numerator 
degrees of freedom and v denominator degrees of freedom. Scheffe test never declares 
a contrast significant if the overall F-test is nonsignificant. 

Scheffe method may be more powerful than the Bonferroni or Sidak methods if the 
number of comparisons is large relative to the number of means. The Tukey-Cramer 
method is more powerful than the Bonferroni, Sidak or Scheffe methods for pairwise 
comparisons. 

As a conclusion, we maintain that there is no “correct” procedure to use. The vari- 
ous procedures trade off power for control of the EER in different ways. Most re- 
searchers believe that the Duncan’s and Eisher’s LSD procedures result in too high an 
EER and should not be used. If you want to be sure that you have controlled the EER, 
then the Tukey HSD should be used at the expense of a lower power. In practice, it is 
advisable to avoid conducting multiple comparisons of a small number of treatment 
means when the corresponding ANOVA F test is nonsignificant; otherwise, confusing 
and contradictory results may occur. Finally, we should remember that failure to reject 
the hypothesis that two or more means are equal should not lead to conclude that the 
population means are, in fact, equal. Failure to reject the null hypothesis implies only 
that the differences between population means, if any, is not large enough to be de- 
tected with the given sample size. 



5 Simulation Results 

In this section we provide two examples of model order selection by using the 
Bonferroni multiple comparison procedure. Given a model selection problem, we 
proceed as follows: 

1. Select an error criterion 

2. Generate 30 values of error for each model as specified in section 2 

3. Select the desired overall confidence level : a=0.1 

4. Use ANOVA F-test to determine whether the means error are significantly 
different from each other. 

5. For each model, determine the set of models not significantly different by 
Bonferroni method. 

6. If the groups are not overlapped, select the model with the least error, and se- 
lect the most simple model in its group. Otherwise, select the model with the 
least error. 



5.1 Determining the Degree of Polynomial Fitting 

Let us consider the problem of finding the degree N of a polynomial P(x) that better 
fits a set of data in a least squared sense. Figure 1 shows the experimental curve and a 
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set of 160 data points generated by adding gaussian noise which will be used in the 
experiment. 80 data points will be used to determine the coefficients, and 80 will be 
used to calculate the RMS error. The only aspect of the polynomials which remains to 
be specified is the degree(M), and so we use a set of polynomials with degree ranging 
from 1 to 10. 




Figure 1: Experimental curve and data points for polynomial fitting. The experimental polyno- 
mial is P(x)=0.4x^-0. 5x^-0. 25x X e [-1 3]. 

As we explained above, 30 RMS errors for each polynomial have been generated. 
We used ANOVA F-test to determine whether the means RMSE are significantly 
different form each other and Bonferroni method to determine whether the observed 
differences in the sample means imply that differences exist among the accuracy of 
the competing polynomials. The overall confidence level is fixed to 0. 1 

Table 1 shows the results obtained in this case. This table shows the polynomial de- 
gree, the RMSE error and the set of polynomial degree not significantly different. Two 
polynomials are not significantly different if the difference between its means is less 
than the critical value computed as 0.02256. In this case, there are three groups. Poly- 
nomials from degree 3 to 10 form a not significantly different RMSE group and a 
polynomial of degree 3 is selected (Occam’s Razor criterion [1]). 

Table 1: Simulation results (160 data points) 



Polynomial 

degree 


RMSE 


Polynomial degrees 
not significantly different 


3 


0.04261 


3456789 10 


4 


0.04340 


3456789 10 


5 


0.04406 


3456789 10 


6 


0.04519 


3456789 10 


7 


0.04543 


3456789 10 


8 


0.04655 


3456789 10 


9 


0.04777 


3456789 10 


10 


0.04903 


3456789 10 


2 


0.18750 


2 


1 


0.50280 


1 
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Table 2 shows the results when the size of data point is 40. Two polynomial are not 
significantly different if the difference between its means is less than the critical value 
computed as 2.75873. In this case the groups are overlapped. Because variation 
among RMSE means are not significant, polynomial degree with the least RMSE 
means is selected. The model with degree 3 is selected. 



Table 2: Simulation results (40 data points) 



Polynomial 

degree 


RMSE 


Polynomial degrees 
not significantly different 


3 


0.06426 


345672819 


4 


0.07468 


345672819 


5 


0.10979 


345672819 


6 


0.11570 


345672819 


7 


0.15173 


345672819 


2 


0.28682 


345672819 


8 


0.45635 


345672819 


1 


0.78130 


345672819 10 


9 


0.97943 


345672819 10 


10 


3.32416 


1 9 10 



5.2 Determining the Number of Hidden Neurons in Multiplayer Perceptrons 

Let us now consider the problem of determining the number of hidden units in a 
feed-forward neural network in a classification task. Let us define a data set where 
each input vector has been labelled as belonging to one of two classes Cj and C^. Eig- 
ure 2 shows the input patterns. In the simulation study, we consider multi-layer per- 
ceptrons having two layers of weights with full connectivity between adjacent layers. 
One linear output unit, M hidden units and no direct input-output connections. The 
only aspect of the architecture that remains to be specified is the number M of hidden 
units, and so we train a set of networks (models) having a range of values of M. 



15, 



10 



1 



-10 

-15 

-20 

-25 




0 



-10 -5 0 



5 10 15 20 25 30 



Figure 2. Sample data distribution. The sample size is N,=280 data of the class C, and Nj=140 

of the class C^. 
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Table 3 shows the simulation results in this case. Two models are in the same group 
if the difference between its means is less than the critical value, 0.02212. Thus, from 
the group of models with less error mean (10 hidden units) the model with 4 hidden 
units is selected hy Occam’s Razor criterion. If the number of models to be compared 
increases, results show that four hidden units is a good selection; that is, there is not a 
statistically significant difference among the error means of neural network architec- 
ture with four or more hidden units. 



Table 3. Simulation Results (280 data points) 



Hid- 
den Units 


Error 

Mean 


Models not 
significantly different 


7 


0.13790 


7 5 8 6 10 9 4 


5 


0.13995 


7 5 8 6 10 9 4 


8 


0.13995 


7 5 8 6 10 9 4 


6 


0.14033 


7 5 8 6 10 9 4 


10 


0.14214 


7 5 8 6 10 9 4 


9 


0.14319 


7 5 8 6 10 9 4 


4 


0.14900 


7 5 8 6 10 9 4 


3 


0.18848 


3 


2 


0.31433 


2 


1 


0.35938 


1 



Table 4 shows the results when the number of data points is 60. In this case two 
models are in the same group if the difference between its means is less than 0.08818. 
We can see that the groups are overlapped. This may be due to two main reasons: 
either we haven't enough data points or the training has been stopped too soon. Be- 
cause variation among misclassification error means is not significant, the model with 
the least error, 5 hidden units, is selected. 



Table 4: Simulation results (60 data points) 



Hid- 
den Units 


Error 

Mean 


Models not significantly 
different 


5 


0.04044 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


3 


0.04222 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


4 


0.04222 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


6 


0.04622 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


7 


0.04778 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


9 


0.04822 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


10 


0.05044 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


8 


0.05111 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


2 


0.06622 


5 


3 


4 6 7 


9 


10 


8 


2 


1 


1 


0.08244 


5 


3 


4 6 7 


9 


10 


8 


2 


1 
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6 Conclusions 

We have proposed a model selection strategy based on multiple comparison proce- 
dures. ANOVA test can be applied to compare the population means and to determine 
the existence of significant differences among the competing models. However, the 
proper application of the ANOVA procedure requires certain assumptions to be satis- 
fied. When the number of tests increases, the probability of making a type I error 
increases with the number of comparisons. Statistical methods to deal with this phe- 
nomenon are called multiple comparison procedures, since they can compare three or 
more means while controlling the probability of making at least one type 1 error. 
When this strategies are adequately applied to the error rates of a well designed ex- 
periment, the needed assumptions are verified, and it is possible to determine the op- 
timal complexity of a given model, or even more, to determine which of a family of 
models fits better to a given problem. This result has been shown to be useful deter- 
mining the optimal degree in a polynomial fitting and the optimal number of hidden 
units in feedforward networks. Future work will address more specific comparison 
procedures and its application to other neuronal models, like radial basis function 
networks. 
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Abstract. The gray-scale morphological Hit-or-Miss transform is theoretically 
invariant to vertical translation of the input function, which is analogous to 
gray-value shift of the input images. Designing optimal structuring elements for 
the Hit-or-Miss transform operator is achieved by neural network learning 
methodology using a shared-weight neural network (SWNN) architecture. Early 
stage of the neural network system performs feature extraction using the 
operator, while the late stage does classification. In experimental studies, this 
morphological feature-hased neural network (MFNN) system is applied to 
location of human face and automatic recognition of vehicle license plate to 
examine the property of the operator. The results of the experimental studies 
show that the gray-scale morphological Hit-or-Miss transform operator is 
reducing the effects of lighting variation. 



1 Introduction 

Due to lighting and illumination variability, the same object or pattern can be 
recognized differently [1]. Even a slight change in lighting environment or shading 
across the object region can easily influence the classification results. This fact makes 
recognition of objects or patterns difficult. While this issue has been a great concern 
in computer vision, most efforts have been devoted to dealing with some property or 
features of the image, i.e., edge or distribution, and reflectance property of the object 
surface [2]. Unfortunately, those approaches are feasible for fully automated image 
recognition and analysis. 

Recognition methods using low-dimensional representation of image objects have 
been recently introduced [3], [4], which often termed appearance-based method 
differed from the feature-based methods. The methods demonstrated ease of 
implementation and accuracy. However, they perform recognition reliably provided 
that the object or pattern has been previously seen under similar circumstances. This 
drawback limits the methods in application for a problem in lighting variability. 

In this paper, we propose an “operation-based” approach to resolve the problem of 
lighting variation. We first introduce an operator, gray-scale morphological Hit-or- 
Miss transform, which is theoretically insensitive to lighting variation in image 
acquisition environment. This operator is applied to the nodes in the feature extraction 
stage of a shared- weight neural network (SWNN) [5]. This morphological feature- 
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based neural network (MFNN) system is then applied to location of human face and 
recognition of vehicle license plate of which primary goal is to recognize four 
consecutive numeric numbers in the license plate. Images were collected in various 
lighting conditions and at different locations. The experimental results demonstrate 
that the gray-scale morphological Hit-or-Miss transform operator is well reducing the 
effects caused by the illumination change in image acquisition environment. 



2 Morphological Operations for Gray-Scale Images 

In this section, we first describe some preliminary definitions and notations which are 
counterparts for those of binary morphology [6]. Then, we introduce two essential 
gray-scale morphological operations, erosion and dilation, and Hit-or-Miss transform 
that plays a role of feature extraction for the morphological filter neural network 
system. 



2.1 Definitions and Notations 

There are three translations defined as : 

• translation : horizontal shift to the right hy z 

/z(x)=/z(x-z) ( 1 ) 

• offset : vertical translation hy the amount y 

(/ + jXx) = /(x)+y (2) 

• morphological translation : translation and offset 

(/z +3')(x)=/(x-z)+y (3) 

The function g is beneath the function/, denoted by g«f, if T>[g]cZ)[/] and g(x)</(x) 
for XGZ)[.g] where D[g] denotes the domain of the function g. 

Counterparts to intersection and union in binary morphology are minimum and 
maximum, respectively. By allowing the negative infinity value for /and g if x is not 
in the domain, the minimum is defined by 

(/ A g)(x) = min {/(x), g(x)} . (4) 

In similar way, the maximum is defined by 

fix) if X e D[f] and x g 

max {/(x), g(x)} if X e {D[f]n 

g(x) if X g D[f] and x e Z)[g] 

undefined if x e {^[/jn Dk]} 



(/AgXx)=i 



(5) 
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The gray-scale operation analog to rotation of a set about its origin is reflection. The 
reflection of a function h is defined by 

h* {x)= -h{-x) . 



2.2 Erosion and Dilation 

Using notion of “fitting”, two essential gray-scale morphological operations erosion 
and dilation are defined [6], [7], The gray-scale erosion of a function /by another 
function (called structuring element) g is defined by 

(/ ^^)(x) = max{y;g,,-l-y«/}. (7) 

Instead of finding the maximum “offset”, we can find the “minimum difference” 
between the function /(z)’s and the structuring element g(z)’s for all ZGD[gf), which 
results in 

(/ ^gXx)=min{/(z)-g,,(z):ze£)[gJ}. (8) 

Note that this measures how well the shape of the structuring element g fits under the 
function/ and it is only defined at any point where g^ «/. 

The gray-scale dilation of a function /by a structuring element g can be defined in a 
dual manner to the gray-scale erosion. This notion leads to the following definitions: 

(/ © ^)(x) = min {y .■ (g)^ + y » f} (9) 

and 

(/ © ^)(x) = max {/(z)- (g *X(z): z e D[{g *)J} . (10) 

The gray-scale dilation indirectly measures how well the shape of the structuring 
element g fits above the function/ 

Several algebraic properties of the gray-scale erosion and dilation are described well 
in [6], [7]. From Eq. (7) to Eq. (10), note that the outputs of erosion and dilation are 
not invariant to offset (vertical translation) of the function/ 



2.3 Hit-or-Miss Transform 

In binary morphology, the Hit-or-Miss transform probes the inside and the outside of 
the image A (i.e., set) with a pair of disjoint structuring elements B = {E, F). 
Mathematical formulation of this transform is defined by 

A®B^{A6E)-[a‘^ ®f). ( 11 ) 

Using the notion of “umbra transform” which is a way to consider a gray-scale image 
a binary image (i.e., a set) in 3-dimensional Euclidean space [6], a novel gray-scale 
Hit-or-Miss transform is defined by 
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f ®{h, m)= (/ 6h)-{f ©ot) 

= min {/(z)- gx(z)}-max{/(z)-(g*)^(z)} 

where ZGD[gJ and D[gy\= Z)[(g*)x]. Complete procedure to achieve Eq. (12) is 
provided in [8]. 

The gray-scale Hit-or-Miss transform has a useful property: if then, 

fS>(h,m)=(f+X)'S)(h, m). Thus, the gray-scale Hit-or-Miss transform can reduce effects 
caused by illumination and sensor parameter changes for image acquisition systems. 
Proof of this property is provided in Appendix. 



3 Morphological Filter Neural Network 



3.1 Architecture of Neural Network System 

Shared-weight neural network (SWNN) is composed of two cascade sub-networks: a 
feature extraction network followed by a classification network [5]. The feature 
extraction network usually has two-dimensional array for the input image and 
performs convolutional operations over its input with the weight kernels, which 
generates feature map. The weight kernel, two-dimensional local connection, is 
identical for the nodes in the same feature map. Each node is corresponding to certain 
position in the input. There can be more than one feature map layer and each layer can 
have more than one feature map. The classification network is an ordinary 
feedforward network of which the input is the feature maps in the last feature 
extraction layer. A more precise description of the SWNN is given in [8]. 

The SWNN is trained with the inputs that have fixed size. The output indicates a class 
to which the input belongs. Eor locating an object in an input image of arbitrary size, 
one should scan entire image with the SWNN. An extended shared-weight neural 
network (ESWNN) architecture that probes entire image more efficiently with the 
weight kernels obtained by the SWNN is proposed in [8], [9], called scanning mode 
operation. Result of scanning mode operation is a detection plane which presents the 
possibility of object existence at each point of the input image. 

Our work is mainly focused on locating and recognizing image patterns (human faces 
and numeric numbers) simultaneously in the input image scenes (indoor images and 
vehicle images). For this purpose, the SWNN and ESWNN provide the suitable 
architecture. 



3.2 Feature Extraction Operation 

Morphological processing has been widely used for pattern recognition as a feature 
extraction methodology [6], [10]. The outcome of the morphological processing is 
highly dependent on the characteristics of the structuring elements [11]. In terms of 
the mathematical morphology, the weight kernel of the SWNN can be considered a 
structuring element. 
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The nodes in the feature extraction network of the ordinary SWNN perform the usual 
linear weighed sum followed by mapping through sigmoid function. Our MFNN 
performs gray-scale Hit-or-Miss transform defined in Eq. (12). To perform this 
operation, each node should have two structuring elements: one for erosion and the 
other for dilation. Output of the node is defined by 



where 



and 



a(x,y) = net|',_^) -net“^-) 

= min {a(c, d) - h( ){c, d )} 
net™ .) = max )(i,y)}. 



(13) 



(14) 

(15) 



In Eq. (13), (14) and (15), and (x,y) indicates the location of the 

output node in the feature map, while (c,d) and (i,j) do those of the input node. Also, 
and y) represent the structuring elements for erosion and dilation, respectively. 



3.3 Learning Rules 



Since the learning rules for the nodes in the classification network is widely available 
in neural network literature [12], we only provide those for the feature extraction 
network. 

For each node in the feature extraction network, the learning rules for each 
structuring element should be provided. Based on the gradient descent method with 
application of chain rule, the learning rules for the feature extraction network are 
summarized as [8]: 



^K.y)ic,d) = r|5^^y^ 



d\^y^{c,d) 



(16) 



and 



where 






gnet™ 3,) 

(b J) 



k 



(17) 



(18) 



for the nodes in the last feature extraction layer and 



(x,y) 



= US 

(P.1) 



(p.1) 



5 net 






8a(x, y) 8a(x, y) 



(p.i) 



(19) 



for the nodes in the hidden layers. In Eq. (18), k indexes the nodes in the first hidden 
layer of the feedforward classification network. 

Assume for a while that all derivatives exist. Then, they are given by 
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anet? , f-1 if(c,c?)= arg min \a{s,t) - h^^ As,t)] 

(±Z)_=J (20) 

[ 0 Otherwise 

0net" , if (c,d)= aigmin {a(s,t)-m^^ Js,t)] 

= J (.,<).«[».<„,] (21) 

^^{x,y)f-C,d) 0 Otherwise 

Snetf , fl if (x,y)= argmin 

= \ (s,0e£>[*„,„] (22) 

5a (x, y) 0 otherwise 

5netr , if(-^>j)= arg min (a(s,f) -«(„„) (^,0) 

(V).n[m„,„] (23) 

da{x,y) ^ Q otherwise 

Note that erosion and dilation (i.e., net^ and net™) are only piecewise differentiable. 
Mathematically, the derivative is not defined for the case that there are more than one 
index for (c,d) and (i,j) in Eq. (14) and (15). In this case, we simply arbitrarily select 
one index. 



4 Image Data Sets 



4.1 Face Images 

Hundreds of human face images were collected from variety of resources. We first 
arbitrarily selected 1000 images that only include frontal view of human face with 
minimum background. The collected images were then resized into 132x100. This 
data set was used only for training the MFNN. 




Fig. 1. Some examples of indoor images {test images) 



4.2 Indoor Images 

We first collected 640x480 gray-scale images from the offices in the university under 
the appropriate lighting condition. From this data set, three thousands of 132x100 
sub-images were collected. This sub-image data set was used as background (non- 
face) to train the MFNN together with the images described in the section 4. 1 . 



812 Y. Won, J. Nam, and B.-H. Lee 



Twenty images of 640x480 were also independently collected. They include 
human faces of which the size was approximately 132x100. This data set was used 
only to test the MFNN. Some images of the test data set are presented in Fig. 1. 



4.3 Vehicle Images 

The vehicle images were collected from image databases of speed control systems. 
The system captured the images of vehicles approximately at the 100-meter distance. 
The size of the images is 480x512. From the database, we selected 1000 images in 
reasonably good condition to collect the training images described in the following 
section 4.4. Some of example images of the training set are presented in Fig 2. 




Fig. 2. Some example images of the training set 



4.4 License Number Images 



From the training image set described in the section 4.3, we manually collected sub- 
images of digits in the license plate, of which the size is 35x19. From those sub- 
images, we arbitrarily selected 200 image patterns for each digit. This data set was 
only used for training the network along with the non-digit image set described in the 
section 4.5. Some examples of digit image along with the images of the license plate 
are shown in Fig 3. Note that we only considered the four consecutive digits. 




opo onriD OiOi'a i mu i i ni ti 
2222222222^333323333 
4,4 44 4 4 4142 5 5 •■555 5 55 5 
64166 664 4447777777777 

ei88a8866i999-999999 



Fig. 3. Some examples of license plates and digits 



4.5 Non-digit Images 

Non-digit sub-images were collected randomly during the training process from the 
outside of the digit region. A bounding box that had a size of 23x1 1 and was centered 
at the center of a digit defined the digit region. This method can overcome problems 
caused by wide variation of background (non-digit). 
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5 Human Face Detection 

In this experimental study, we trained the MFNN with the sub-image sets described in 
the section 4.1 and 4.2. During training the network, each image was presented by 
shifting three pixels in both horizontal and vertical directions around the center of the 
image. Background region was filled with the image region discarded by shift 

In this experiment, the feature extraction network had a single layer with two 
feature maps. The size of the structuring elements was 5x5. The feature extraction 
operation was done at every other point, which dramatically reduces computational 
complexity [8], [9]. Therefore, the output of the feature extraction operation had the 
half size of the input. The classification network also had a single hidden layer with 
four nodes and two output nodes: one represents the human face while the other does 
the background. 

To locate the human faces using the ESWNN architecture, we used the detection 
plane generated by the output node representing the human face class. The detection 
plane was first multiplied by 255 and thresholded by the value 230. We then 
repeatedly applied the erosion with 3x3 flat structuring element until the last non- 
empty eroded image was obtained. The MFNN located the human faces in 16 images 
out of 20 test images with some false alarms that classified the background a human 
face. The undetected ones were captured in very dark environments such as the last 
image in Fig. 1. In comparison, the ordinary SWNN only detected the faces in 12 
images and produced many false alarms. 



6 Vehicle Identification 

We trained the MFNN with the image data sets described in the section 4.4 and 4.5. 
The feature extraction network had a single layer with 5 feature maps. The size of the 
structuring elements was 5x5. The classification network had also a single hidden 
layer with 25 nodes and 10 output nodes: each output node corresponded to a number 
from 0 to 9. 

Training was stopped when the number of epochs reached 200 or the RMSSE was 
below 0.05. When a non-digit image was presented, desired output values for all 
output nodes were 0. Learning rate 0.05 and the momentum 0.8 were used. Trained 
network was then test with the image data set described in the section 4.3 using the 
ESWNN architecture. 

We first thresholded 10 detection planes with 0.87 to generate binary detection 
planes, and then performed pixel-by-pixel logical OR operation by 

9 

D{x,y)= vj Pi{x,y) 
i=Q 

where R, indicates a detection plane. Fig. 4 shows test images and their corresponding 
binary images D called detection image plane. The white pixel in the D image was 
then labeled with the number to which it corresponds. 

We first extracted coordinates of I’s from the detection image plane D. We used a 
coordinate system that had the origin at the left-top corner and x as the horizontal axis 
and y as the vertical axis. Assume that the pixel (x^, y) is labeled in p and is 
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labeled in q. Then, a simple rule “q is the next digit to p if (x+15< x^j <JC_+22)n(j-4< 
y,+i< L+4)” can determine four consecutive numbers. This rule was generated based 
on the facts observed from the detection planes. 




Fig. 4. Examples of test results 



It is possible that a white pixel have multiple labels. In this case, we simply 
selected the label corresponding to the detection plane that had larger value for that 
pixel. Test on arbitrarily selected 500 images resulted in 441 corrects (88.2%). Only 6 
vehicles (1.2%) were miss-identified. 53 vehicles (10.6 %) were rejected. The border 
of the plate frame, which is easily confused as the digit 1 and thus generated five 
consecutive numbers, caused most of rejections. 



7 Conclusion 

Lighting and illumination change makes recognition of objects difficult. We proposed 
a gray-scale morphological Hit-or-Miss transform which has a useful property that 
can reduce effects due to lighting condition and sensor parameter changes for image 
acquisition systems. The gray-scale morphological Hit-or-Miss transform is applied to 
node operation of the feature extraction network in the shared-weight network. 

We examined the property of the operator hy performing two experimental studies: 
human face detection and vehicle identification. Results of our experiments 
demonstrate that the MFNN is immune to reasonable lighting changes. 
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Appendix: Proof for Property 



Let z be a point in D\f\, the domain of/. Assume that D[K\= D[ni\= D[/t*]= D[m*]. 
Let \=D[hj\. Then, 






xel 


[x)| - max^f {xj + 1 - m^ix^ 


= mm|/(x)-/i2(x)| 
xeU 


+ A, -maxlf{x^-mi^(x^\-A, 
xsl 1 J 


= mm|/(x)-/z2(x)| 
xeU 


- maxi f {x^ - m^{xj\ 

xel 1 J 


= [/ ®(/t,m)](z) . 


Q.E.D. 
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1 Introduction 

In pattern recognition, the choice of features to he detected is a critical factor to 
determine the success or failure of a method; much research has gone into finding the 
best features for particular tasks [1]. When images are detected by digital cameras, 
they are usually acquired as rectangular arrays of pixels, so the initial features are 
pixel values. Some methods use those pixel values directly for processing, for 
instance in normal matched filtering [2], whereas other methods execute some degree 
of pre-processing, such as binarizing the pixel values [3]. 

An important tool for pattern recognition is the correlation matrix between objects 
and its zero-mean cousin, the covariance matrix. Because the curse of dimensionality 
plagues so many pattern recognition procedures, a variety of methods for 
dimensionality reduction have been proposed. One of the classical statistical 
procedures is the principal component analysis [1]. This method (known in the 
communication theory literature the Karhunen-Loeve expansion ) finds a lower- 
dimensional representation that accounts for the variance of the features. The 
diagonalization of the correlation or covariance matrix is significant for image 
processing, because among other advantages, it implies the decomposition of images 
into independent components, it minimizes entropy, it minimizes the mean squared 
error when some terms are removed, and it is related to principal value decomposition 
and to factor analysis. Unfortunately the diagonalization of large matrices 
corresponding to the covariance matrices of images with many pixels is often beyond 
the capacity of even today's powerful computers. 

So it is clear that finding an image decomposition that easily diagonalizes a 
correlation matrix is of interest; in this paper, we introduce such a decomposition, 
which we then use in order to see some familiar pattern recognition techniques in a 
new light, and to propose a new and powerful approach to pattern recognition. The 
correlation matrix that we propose should not be confused with the classical 
covariance matrix between objects. Our correlation matrix is between the non-linerly 
transformed features of two objects. 
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2 Image Representation in Terms of the SONG Decomposition 



Any two-dimensional quantized gray-scale image f(x, y) can be decomposed into a 
sum of orthogonal elementary images {e^(f )} having the orthogonal property 
e„(f(x,y))ejf{x,y)) = 0 ifm^n 
<^Sfix,y))e^{f{x,y)) = \ ifm = n 



( 1 ) 



Each sub-image (fix, y))} represents a gray level slice of the object. We define 
the Sliced Orthogonal Nonlinear Generalized (SONG) decomposition of f(x, y) as 

fix,y) = Y^qe^if(x,y)) ( 2 ) 

q=\ 



where Q is total number of gray levels in the image and the basis is defined as 
f(x,y) = q 
1 0 otherwise 



eAfix, y)) = 



( 3 ) 



A more general definition will be introduced in Section 3. Note that each object point 
is characterized by only one gray level, so each ^-slice is orthogonal to all of the 
others, as indicated by the orthogonal property. By considering a three-dimensional 
space with coordinates (x,y,q), the method can be interpreted as one of placing planes 
parallel to the (x,y) coordinate plane of the image; each plane then “slices” the 
function in the area of intersection. All of these areas form disjoint sets of pixels. 

The function e^(f(x,y)) is an eigenfunction of the image pixel matrix f(x,y) with 



eigenvalue q, because f(x, y)e^(f) = qe^(f(x, y)) . This property will be important 
when we consider the correlation matrix. 



3 Pattern Recognition in Terms of the SONG Decomposition: The 
SONG Correlation Matrix 

We shall define a SONG correlation [4] based on the new binary SONG 
decomposition. We shall also relate this correlation to a correlation matrix. Each 
coefficient of the matrix can be viewed as the cross-correlation between two binary 
slices of the input scene and of the reference object binarized using the SONG 
decomposition. The general definition of the SONG correlation matrix allows the 
representation of common correlation operations as special cases of the correlation 
matrix. 



3.1 Definition of the SONG Correlation 

A more general definition of the SONG correlation between an input function g(x) 
and an object prototype f(x) is 
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Q-l 



e-1 



W = X ^ z)e, (f(x)) 

q=l k=l 



( 4 ) 



where W(q,z) and W'ik, x) weighed correlation factors and ^ is a parameter 

that may depend on detection functions or other metrics and ® denotes the linear 
correlation operation. For the sake of clarity, we use one-dimensional functions 
although we will apply this correlation to two-dimensional images. We introduce a 
general modified representation of any image as 

f{x) = ^W{q,x)e^{f{x)) (5) 

q=\ 



Note that if the weight factors are equal to the gray level values, we reconstruct the 
original function /fxj. Then the correlation of Eq. (4) can be expressed in matrix terms 
as 






Q-l Q 

n 



R 



qk 



Rn Ru - Ru 



© 



D D D 

-''•(2-1)1 '''■(2-1)2 ••• ■''•(2-iXe-i) 



( 6 ) 



where = W{q, xW'{k, x){eqig{x))® e^{f{x))]=W{q, xW'{k, x)LC* . The 

term LC* represents the linear sub-correlation between the slice of g(x) and the k- 

slice of/(x); the operator 0 means the summation of all the terms of the matrix. It 
can be viewed as the norm of the matrix, but, the norm of a matrix is defined as 
||Q|| = T’rjQ'^Q} where Tr{} is the trace of a matrix, where the sum of the squared 

coefficients is used instead of the sum of the amplitudes. So our operator 0 can be 
viewed as the norm of the matrix in the absolute value sense. 



In the following, we shall consider for simplicity objects located such that their 
correlation peaks appear at the origin (0,0) because the correlation operation is shift 
invariant, it is trivial to generalize to the case of targets located at any point (x,y) and 
to the presence of multiple targets, since the correlation is also additive 



For the case of autocorrelation where g(x) - f(x ) , the matrix of Eq. (6) becomes a 
diagonal matrix because of the orthogonality between the slices, and 

K ■■■ 0 1 



Q''"'” = 0 . 




(7) 



0 0 



A 



(e-ixe-i) 



The autocorrelation coefficients of the matrix are - W^(q,x)RC^ . Note that we 

do not need the squared values of the coefficients because the slices are binary 
functions. 
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In the absence of any a priori information about the input scene, there is no reason to 
put any different weights on the binary slices. So the SONG correlation definition that 
we shall consider is 

- li k ® (/(-^))] -Z K ( 8 ) 

q=l q=l 

where W(q,x) -W’(k, ;() -1 . We are considering only the corresponding q-giay 
level binary slices for the g(x) and for the, f(x) functions. This can be viewed as the 
summation of the diagonal terms. A similar interpretation is to set to zero all the off- 
diagonal terms the SONG correlation matrix. 

The definition of the SONG correlation of Eq. (8) means that this operation can be 
viewed as a function that counts the total number of non-zero pixels (or points) in an 
image at the origin. Indeed, the SONG correlation process consists of separately 
correlating each binary slice from the image with each binary slice of the prototype 
corresponding to the same gray level, and then summing the correlation values. 
Because the base {e, (/)} is orthogonal in the f{x) domain, each correlation value is 

proportional to the number of pixels that are common to both slices. For the auto- 
correlation, the sum for all the gray levels yields the total number of pixels in the 
object. So the SONG auto-correlation at the origin is equivalent to a counting 
operation: the height of the auto-correlation peak (in the absence of noise) is equal to 
the number of pixels in the object, and in the case of false targets, the height of the 
correlation peak is equal to the number of pixels that have the same gray level values 
in the target and in the prototype. Note that there could be objects that have the same 
number of pixels but that look totally different from the reference object. But the 
counting operation that results from our cross-correlation measures the number of 
equal-valued pixels that are in the same locations for both objects, which is a good 
measure of similarity. 

So coming back to what we pointed out in the introduction, we introduce a diagonal 
correlation matrix representation for pattern recognition. This can be viewed as a 
dimensionality reduction of the data, because we used the sum of only the diagonal 

terms. Moreover, this choice of terms is best because the autocorrelation values ( , 

from Eq. (7)) are found only along the diagonal, where the correlations of the 
corresponding gray level slices are found. Other off-diagonal terms correspond to the 
correlations between different gray level slices, and they add a background to the final 
result without improving the discrimination. 



3.2 Common Linear Correlation in Terms of the SONG Matrix 

The linear correlation, LC^ , between two quantized gray level functions g(x) and/(x) 
can be defined in terms of the SONG decomposition as 



LCAx,y) = 



Y_,qe^{g{x,y)) 






^ke,(f{x,y)) 




q=l 



e-1 

k=i 



qkLCf, 






LC 



( 9 ) 
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The subindex indicates that we express the LC as a particular case of the SONG 
correlation ( Q ). So linear correlation can be expressed by means of the SONG matrix 
using the particular weights as: W^^"’ = and = qk . Note that the dimensions 

of the matrix are [Q-IxQ-I]- So (Q-1)^ sub-correlations are needed to express 
the normal linear correlation (which, when done in the usual manner, are all carried 
out in parallel). The same number of sub-correlations are required for the SONG 
correlation, and the difference between both correlations are the weight factors. 
However, because we shall use the particular case of the SONG correlation 
corresponding to Eq. (8), we need only (Q-V) sub-correlations. The SONG 
correlation has a significant advantage in discrimination capability because only slices 
corresponding to the same gray levels are compared. 



4 Noise Robustness of the SONG Correlation 

One might expect that because the SONG correlation is very selective for object 
detection and discrimination, it might have poor noise robustness. In this section we 
consider images that are degraded by very strong Gaussian noise. We prove that 
whereas other common detection methods like common matched filtering and Phase 
Only Filtering (POE) are not able to detect the correct object, the SONG correlation 
will succeed. Moreover, the discrimination capability (DC), measured in terms of one 
minus the ratio between the cross-correlation and the autocorrelation, will be more 
stable than for the other methods over a wide range of Gaussian noise levels. 

The input scene is shown in Fig. 1. It consists of two objects, the reference object 
being the one in the lower part of the image. This image has 8 gray levels. 




Fig. 1. Input scene with the reference object, placed in the lower part of the figure, and another 

object to be rejected 

Figure 2 is the input scene of Fig. 1 degraded by white additive Gaussian noise 
(a=1.9). The visual pattern information is wiped out by the noise, but as long as some 
pixels of the image remain unaffected, the SONG correlation will yield a high signal. 
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Fig. 2. Input scene highly degraded with Gaussian (cf=l.9) noise 



Table 1 shows the discrimination capability (DC) for the SONG correlation and for 
other common detection methods. A high value of DC, means that the value of the 
cross-correlation is low compared to the autocorrelation, which means that a good 
discrimination and good noise robustness are achieved. On the other hand, a low 
value of the ration means that the energy of the cross-correlation has almost the same 
value that of the auto-correlation. 

We generated white Gaussian input noise patterns with various standard deviations a. 
Three correlations were considered: the SONG correlation, the linear correlation and 
the LC using a phase-only filter (POP). 

Table 1 shows that for highly degraded images, only the SONG correlation is able to 
detect the reference object with a high degree of discrimination. Note that the DC for 
the SONG method is a high stable value for all the noise levels. On the contrary, none 
of the other methods yield high values of the DC, which implies poor performance for 
correctly detecting the reference object. 



GAUSSIAN NOISE 
Mean=0 
Stand. Dev. (d) 


SONG Correlation Linear 

Correlation 


Phase Only 
Filter 


0.2 


0.95 


0.05 


0.80 


0.25 


0.95 


0.00 


0.70 


0.5 


0.95 


0.00 


0.3 


0.75 


0.95 


0.00 


0.2 


1 


0.95 


0.00 


0.2 


1.2 


0.95 


0.00 


0.1 


1.9 


0.95 


0.00 


0.1 



Table 1. The discrimination capability (DC) of several pattern recognition operations and the 
new SONG correlation. 



5 Conclusion 



We have introduced a sliced orthogonal nonlinear generalized (SONG) matrix 
representation that allows a representation of common linear and nonlinear 
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correlations. The SONG correlation can be expressed in terms of a sum of linear 
correlation between the binary gray level slices of an input scene and of a reference 
object. A weight factor is considered in the definition to allow a more general 
definition. The discrimination ability and the noise robustness in the presence of 
Gaussian noise are superior for the SONG method. 
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Abstract. The fc-nearest-neighbour (fc-NN) search algorithm is widely 
used in pattern classihcation tasks. A large set of fast fe-NN search al- 
gorithms have been developed in order to obtain lower error rates. Most 
of them are extensions of fast NN search algorithms where the condi- 
tion of finding exactly the k nearest neighbours is imposed. All these 
algorithms calculate a number of distances that increases with k. Also, 
a vector-space representation is usually needed in these algorithms. 

If the condition of Ending exactly the k nearest neighbours is relaxed, 
further reductions on the number of distance computations can be ob- 
tained. In this work we propose a modification of the LAESA (Linear 
Approximating and Eliminating Search Algorithm, a fast NN search al- 
gorithm for metric spaces) in order to use a certain neighbourhood for 
lowering error rates and reduce the number of distance computations at 
the same time. 

Keywords: Nearest Neighbour, Metric Spaces, Pattern Recognition. 



1 Introduction 

Non-parametric classification is one of the most widely used techniques in pattern 
recognition |2|. One of the simplest techniques (and one of the most popular) is 
to use the nearest-neighbour (NN) classifier which, given an unknown sample x, 
finds the prototype p in the training set which is closest to x, then it classifies x 
in the same class as p. The NN classifier usually obtains acceptable error rates, 
but it is possible to obtain better (lower) error rates using a number k of nearest 
neighbours. Thus, a fc-nearest-neighbour (fc-NN) classifier finds the fc nearest 
neighbours of the sample x, and then, through a voting process, it classifies x in 
the class which has most representatives among those fc nearest neighbours. 

Usually, these classifiers are implemented through an exhaustive search; that 
is, all the distances between the sample and the prototypes in the training set are 
calculated. When the representation space is an Euclidean space, this exhaustive 
search is usually not very time-consuming. On the other hand, when working 

* The authors wish to thank the Spanish CICyT for partial support of this work 
through project TIC97-0941. 
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on a metric spac^ in which the temporal cost of calculating the distance bet- 
ween two prototypes is high (as for instance the edit distance when classifying 
handwritten characters |3), the exhaustive search becomes impractical. 

Several algorithms (AESA [3, LAESA and TLAESA among others) 
have been developed which find the nearest neighbour in a metric space with a 
low number of distance calculations. Also, the AESA algorithm has been exten- 
ded to find the k nearest neighbours with a low number of distance calcula- 
tions. In this paper, we present an extension of the LAESA algorithm that uses 
at most k neighbours to classify the sample. Although this extension does not 
find exactly the k nearest neighbours, the error rates obtained are very close to 
those of a classifier that uses the exact k nearest neighbours. 

In the next section we will introduce the LAESA algorithm and the pro- 
posed modification, the Approximating fc-LAESA (AA:-LAESA) algorithm. The 
following sections describe the experiments and the results obtained, which show 
that the Afc-LAESA obtains error rates close to those of a fc-NN classifier, while 
at the same time calculates a reduced number of distances. Finally we present 
the conclusions and we outline some future work. 



2 The Afc— LAESA Algorithm 

The A/c-LAESA algorithm is derived from the LAESA algorithm . This 
latter algorithm relies on the triangle inequality to prune the training set. It 
uses a lower bound of the real distance between each prototype and the sample, 
which is used to eliminate all those prototypes that cannot be closer to the 
sample than a given one. The LAESA algorithm has two main steps: 

1. Preprocessing step: 

This step is carried out before the classification begins. First, it selects a set 
of a number of prototype^ called base prototypes. Then, it calculates and 
stores the actual distances between each base prototype and all the other 
prototypes in the training set (including the base prototypes). 

2. Classification step: 

During this step, the distances calculated in the preprocessing step are used 
to obtain a lower bound of the distance between each prototype and the 
sample. Using this lower bound of the distance, the algorithm iteratively 
finds a good candidate to nearest neighbour, calculates the actual distance 
between this candidate and the sample and prunes the training se10, until 
there is only one prototype: the nearest neighbour. 

The Afc-LAESA (see figure 0 is a simple but powerful evolution of the 
LAESA algorithm that simply stops the search for the nearest neighbour when 

^ A metric space is a representation space which has some kind of metric defined. 

^ This number depends exclnsively on the dimensionality of data (see 0 for more 
details). 

^ All the prototypes whose lower bound of the distance is bigger than the actual 
distance between the candidate and the sample can be safely eliminated. 
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the number of remaining (not pruned) prototypes is less than a number k. Then, 
it classifies the sample by voting among those prototypes. Our experiments show 
that the number of prototypes used in the voting is in general smaller than k, and 
that those prototypes are not exactly the k nearest neighbours. Despite of this, 
error rates for this algorithm are very close (different in less than 1%) to those of 
a classifier using the exact k nearest neighbours. Also, the algorithm calculates 
a lower number of distances than a classifier using the k nearest neighbours, and 
also calculates fewer distances than the LAESA algorithm itself. 

3 Experiments 

In 0 we presented some results of the Afc-LAESA algorithm on synthetic data, 
using 2 classes. In this work, the algorithm for generating clustered data appeared 
in [3 has been used to generate synthetic data from 4 classes and with two values 
for the dimensionality of the generated data: 6 and 10. This algorithm generates 
random synthetic data from different classes (clusters) with a given maximum 
overlap between them. Each class follows a Gaussian distribution with the same 
variance and different randomly chosen means. The overlap was set to 0.04 and 
the variance to 0.05, in order to obtain low error rates (less than 8% for the NN 
classifier). 

Once we had the synthetic data prepared, several partitions of training and 
test set with different sizes were made. For training sets, the sizes used were: 
1024, 2048, 3072, 4096, 6144 and 8192. The test set size was 512 for all training 
set sizes. All experiments were repeated 16 times with different training and test 
sets of each size, in order to obtain statistically significative results. 

In the first experiment we compare A/c-LAESA error rates with those of the 
NN classifier and various /c-NN classifiers. The experiment has been repeated for 
k = 7 and k = 17, using data of different dimensionalities (see figure |2|). As we 
could expect, Afc-LAESA has lower error rates than the NN classifier (also lower 
than some fc-NN classifiers), but they are slightly higher than those of the fc-NN 
classifiers, using the same value of k. Another point is that the A/c-LAESA does 
not always use k prototypes to classify the sample: it uses at most k. This feature 
could explain the difference between Afc-LAESA error rates and fc-NN classifiers 
error rates, for a fixed value of fc. The average number of neighbours used by 
Afc-LAESA was calculated, in order to perform a more accurate comparison 
(see figure E|). The numbers were 5.27 for fc = 7 when dimensionality was 6, and 
6.78 prototypes with dimensionality 10. For fc = 17, the numbers were 9.22 and 
15.42, respectively. Figure |2I shows a comparison of the Afc-LAESA with a fc-NN 
classifier that used aproximately the same number of prototypes (6, 7, 10 and 
16 respectively), and also with a fc-NN classifier with fc = 7 and fc = 17. The 
error rates of a NN and a 3-NN classifiers also have been plotted. 

Even though Afc-LAESA error rates are slightly higher than those of the 
fc-NN classifiers (but lower than those of a NN classifier), our experiments show 
that the number of distance calculations of the Afc-LAESA does not depend on 
the size of the training set (see figure EJ, while the number of distance calculations 
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Input: P G E, n = \P\', { finite set of training prototypes } 

B C P, m = |B|; { set of Base prototypes } 

D G { precompnted n x m array of interprototype distances } 

X £ E\ {test sample } 

fc G N; { maximum nnmber of neighbours to use } 

Output: c G N; { class assigned to the sample x } 

Functions: d : B x B — >■ R; { distance function } 

VOTING : B — >• N; { voting function } 

Variables: p, q,s,b £ P; 

G £ R"; { lower bounds array } 

p* £ E\ { best candidate to nearest neighbonr } 

d , dxs, Pp, Qq, Qb G R; 

nc G N; { nnmber of compnted distances } 

stop : Boolean; { used to stop the search } 

begin 



d* := oo;p* := indeterminate; G := [0]; 
s := arbitrary_element(B); nc := Q;stop :=false; 

while |B| > 0 and not stop do 

dxs ■= d{x, s)\nc ■.= nc + 1; { distance computing } 

B:=P-{«}; 

if dxs < d* then 

d* := dxs’,p* ■= s { updating d* , p* } 

endif 

b indeterminate; gt := oo; q := indeterminate; Qq := oo; 

for every p £ P do { eliminating and approximating loop } 

if s G B then 

G[p] :=max(G[p], \D[p, s] — dxs\)', { updating G, if possible } 

endif 



9p ■= G[p]; 

if p G B then { approximating: selecting from B } 

if Qp < Qb then := gt,\b ■- p endif 
else { p ^ B } 

if gp >= d* then 

P := P — {p} { eliminating from P — B } 

else { approximating: selecting from P — B } 

if gp < gq then gp := gq\ q p endif 
endif 
endif 
endfor 

if b T^indeterminate then s := b 

elsif |P| < fe then stop :=true { stop the search } 

else s := q endif 
endwhile 

P := P U {p*}; { retrieve the best candidate to NN } 

c :=VOTING(P); 

end 



Fig. 1. Algorithm Afc-LAESA 




number of distance calculations 
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of a fc-NN classifier (when using the exhaustive search) is exactly the size of the 
training set. When the temporal cost of calculating the distance between two 
prototypes is high, the Afc-LAESA is faster than the /c-NN classifier, and obtains 
error rates very close to those of the fc-NN classifier. 

The second experiment was performed to show that the Afc-LAESA calcu- 
lates less distances than the LAESA algorithm. Also, as it happens with the 
LAESA algorithm, the number of distance calculations does not depend on the 
training set size. Figure 0 shows the average number of distance calculations of 
the LAESA algorithm, and the Afc-LAESA with fc = 3 and k = 7. 



Dimensionality 6 





Fig. 3. Average number of distance calculations of the LAESA and Afe-LAESA algo- 
rithms, as the training set size increases. 



The aim of the third experiment was to show the behaviour of A/c-LAESA 
when the value of k increases, and a comparison with the behaviour of a k- 
NN classifier was also made. As shown in figure E] the error rate starts (with 
dimensionality 6) at a value close to 7% in both classifiers. As the value of k 
increases, the rate tends to a 4% for the fc-NN classifier and to a 5% for Ak- 
LAESA. Figure 0 plots the average difference between Afc-LAESA error rates 
and the /c-NN classifier error rates, showing that the fc-NN classifier error rates 
are lower, but the difference is (with dimensionality 6 and 10) less than 1%. 

Finally, an experiment was developed to find out how many prototypes used 
by the Afc-LAESA algorithm to classify were actually among the k nearest 
neighbours (see figure El). The tables in figure El show, for two different values of 
k: 

1. The average number of prototypes used by the Afc-LAESA in the voting 
process, 

2. How many of these prototypes are among the k nearest neighbours (and 
also the percentage), that is, how many of the final prototypes used by the 
Afc-LAESA are one of the k nearest neighbours. 

3. How many are among the 2k nearest neighbours, and 

4. How many are among the 4fc nearest neighbours. 



error rate (%) error rate (%) 
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K-NN classifier Dimensionality 6 
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Fig. 4. Error rates of the Afc-LAESA classifier compared to fc-NN classifier, as the 
value of k increases. 



The results show that the number of prototypes used by the Afc-LAESA which 
are among the k nearest neighbours decreases as dimensionality increases, while 
at the same time A/c-LAESA error rates improve (see figure EJ. 

4 Conclusions and Future Work 

We have developed a fast classifier based on the LAESA algorithm El which 
obtains error rates close to those of a fc-NN classifier, while calculating a low 
number of distances. Also the temporal and spatial complexities of the Ak- 
LAESA algorithm are the same than those of the LAESA algorithm. The Ak- 
LAESA error rates (and its behaviour as k increases) are very close to those of a 
classifier that uses the k nearest neighbours. The Afc-LAESA performance seem 
not to decrease as the size of the training set grows. Also, this behaviour of the 
Afc-LAESA seems to improve as dimensionality of data increases. 

As for the future, we will explore the behaviour of the Afc-LAESA with data 
of dimensionalities higher than those used for these experiments. Also, we plan 
to apply this algorithm to real data tasks. 

Currently, the base prototypes selection algorithm of the A/c-LAESA has 
been borrowed from the LAESA algorithm, and we have also used the LAESA’s 
optimal number of base prototypes for each value of the dimensionality. We think 



average difference between error rates (%) 
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AK-LAESA vs K-NN 




Fig. 5. Average difference between error rates of the Afc-LAESA classifier a fc-NN 
classifier, as the value of k increases. 



1 Dimensionality 6 I 


Value of 


Prototypes 


Among 


Among 


Among 


k 


Used 


k NN 


2k NN 


4fc NN 


7 


5.27 


2.33 (44%) 


3.04 (58%) 


3.74 (71%) 


17 


9.22 


5.14 (56%) 


6.49 (70%) 


7.61 (83%) 



1 Dimensionality 10 I 


Value of 


Prototypes 


Among 


Among 


Among 


k 


Used 


k NN 


2k NN 


4fc NN 


7 


6.78 


1.42 (21%) 


1.86 (27%) 


2.56 (38%) 


17 


15.42 


3.61 (23%) 


5.57 (36%) 


8.15 (53%) 



Fig. 6. Prototypes used by the Afc-LAESA which are among the k nearest neighbours. 
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that a different number of base prototypes or a different selection algorithm for 
base prototypes can improve the error rates of the A/c-LAESA, specially with 
low dimensionality data. 

Acknowledgements: The authors wish to thank Jorge Calera-Rubio and Mikel 
L. Forcada, for their invaluable help in writing this paper. 
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Abstract. This paper presents a real application that adds learning capabilities 
to a telerobotic system, designed formerly to manipulate everyday objects over a 
board. The user interaction is based on a restricted natural language and object 
recognition techniques have been designed specifically in order to acquire a 
system that responds as fast as possible to the user commands. First of all, the 
article will introduce the overall description of the system, in order to get a gen- 
eral idea of its main functions. Then, it will focus on a comparison of different 
recognition procedures that have been already added to the implementation. It 
will show the algorithm that fits better to the real needs of the project, looking 
for a balance between efficiency and recognition capabilities. 



1 Introduction 

In the recent years telerobotic systems have become more and more important, par- 
ticularly due to the fact that the number of inaccessible and risky work-sites has in- 
creased. Telerobotic systems are now evolving beyond the function of being pure 
extensions of their operators, and nowadays this is one of the major concerns for re- 
searchers in the field. According to [6], a good telerobotic system should be both 
transparent and semi-autonomous. The system should be designed to augment the 
operator’s capabilities by semi-automating tasks whenever it is possible and cost- 
effective. 

In this paper we present a real application whose goal is to design and implement a 
telerobotic system that includes learning capabilities and aims to a Human-Robot 
communication based on a subset of the natural language. Besides this, the paper is 
focused on the analysis of several recognition procedures in order to realise which one 
of them presents a higher percentage of hits and a better performance, allowing the 
interface to be used in a real time manner. The system, which provides control over a 
generic vision-guided robot, follows an Object Oriented Distributed Arquitecture by 
means of the CORE A standard [12]. This allows the system to interconnect multiple 
modules running on different platforms and to be implemented using different pro- 
gramming languages (PROLOG, LISP, C-H-, Java, etc). Besides this, the user inter- 
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face is defined by using the Java programming language, allowing it to be run on mul- 
tiple platforms or even better, over the Internet. 



2 Related Work 

Many different telerobotic systems have been reported since Goertz introduced the 
first teleoperator at the Argonne National Laboratory four decades ago [4]. As ex- 
pected, most of the former systems required expensive hardware and software in order 
to implement a user interface that operates far away from the robot. Nowadays, the 
expansion of the World Wide Web has allowed a increasing number of user interfaces 
that control remote devices not only for robots, but also cameras, coffee pots, and 
cake machines, to name some. The first telerobotic systems with this kind of interface 
were presented by researchers from the University of Southern California (USC) and 
the University of Western Australia (UWA). The Mercury Project [5], carried out at 
the USC, led to the development of a system in which the manipulator was a robotic 
arm equipped with a camera and a compressed air jet, and the interface consisted of 
web page that could be accessed using any standard browser. The telerobotic system 
developed at the UWA [15] lets the user control an industrial robot to manipulate 
objects distributed on a table. The user interface allows an operator to specify the co- 
ordinates of the desired position of the arm, the opening of its gripper and other multi- 
ple parameters. 

In our system, we aim to allow operators to use their voice to control the robot 
movements. Besides this, the system is able to learn from experience and the user 
interaction. Thus, it will accept high-level commands such as “pick up the small pen”, 
that require the use of some acknowledge to be carried out. On the other hand, as we 
will see in the following sections, the recognition algorithm has been selected in order 
to allow it to be used as a Real-Time application. The Robot response must come up 
in less than a few seconds. In fact, the problem to control a robot via the web will have 
many additional difficulties because it is a very uncertain environment where the ve- 
locity of data transmission can not be guaranteed, and the delay is always present. A 
very recent discussion about this topics in the telerobotics domain can be found in [3]. 
However, we have selected the web as the best way of communication between the 
user interface and the robot because it allows the manipulator to be accessed from any 
computer connected to the internet. It makes this connection cheap and easy, and we 
consider these reasons are sufficient to prefer the Internet to a dedicated network that 
would allow us a much better performance. Besides, as new broadband Internet con- 
nections are coming up (fe. ADSL or Satellite) this web bottleneck will fade out too. 



3 Overall System Description 

This section summarises the complete project we are involved on, which consists 
of a telerobotic system based on Computer Vision and a static robot, programmed in 
real time by a remote interface using a restricted natural language input. 
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A camera is used to obtain the environment images with the objects the robot can 
access to. These images are transferred to the computer running the user interface, 
using Internet as the access medium, and then showed to the user in order to allow him 
to know the real environment state. The restricted natural language input gets into the 
system via the user interface, which translates in real time these commands into an- 
other language that the robot can understand. 

First of all, we should address the existence of a previous work explained in [13], 
where all the problems related to grasping and camera operations are already sorted 
out. With the aim to make good use of the features already implemented there, we 
have designed a Server application that offers low level grasping and camera services. 
We will refer to this subsystem as “Robot Server”. The Robot Server is in charge of 
setting up the robot and the camera, and controlling them. This subsystem runs on a 
PC and interacts with a robot SCARA via the parallel port. The Robot Server capa- 
bilities are used remotely by the user interface running on the Internet through a dis- 
tributed object oriented interface implemented with CORBA (Common Object Re- 
quest Broker Arquitecture) [12]. 

Secondly, we will introduce the user interface capabilities. It gets an image as input 
with the objects in the scene, a Database with descriptions of the objects that the sys- 
tem can recognise and a command specified in the restricted natural language by the 
user. The output would be a low-level set of commands in a protocol that the Robot 
Server can understand. We will refer to this protocol as GCCP (Grasping and Camera 
Control Protocol). This GCCP set of commands will be evaluated first by the Robot 
Server and finally translated to a different set of low level commands depending on 
the particular robot we are interacting with. All this process is summarised in fig. 1 . 




Fig. 1. Overall System Arquitecture. This figure is a brief description of the different entities 
belonging to the system. The Robot Server knows about GCCP, tbe particular robot interacting 
with and grasping characterisation. User interface recognises a few voice commands and trans- 
lates them to GCCP by making use of tbe objects database. Acknowledge retrieved from user 
interaction is updated to the Database (learning capability). 
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4 User Interface Description 

Basically, the user interface consists of a Java application that allows users to ob- 
tain images of the robot’s workspace and configuration information about the robot 
and the camera, send commands to the Robot Server for controlling the robot and 
access the object database in the Database Server to read and update the robot ac- 
knowledge. 

The user interface, which is shown in fig. 2, allows the use of the restricted natural 
language for specifying the commands to be sent to the Robot Server. This means 
users can employ the same names and adjectives they use to refer to objects in the real 
world, as well as the same constructions for expressing location (above, at the left 
hand of, etc). The natural language module translates user commands into GCCP 
interactions that the robot can understand. This module has access to the database of 
known objects in order to check if the objects that users are referring to are or not 
manageable by the system. When an object cannot be recognised by the module de- 
scribed in the previous section the user will be asked for the object name. 




Fig. 2. User interface (remote controller) 
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5 Object Recognition Analysis 

As the system is intended to be used in a real-time manner, this section analyses 
several recognition algorithms that may allow this kind of interaction. The object 
recognition and learning tasks are performed by a user interface module, which per- 
forms an analysis of the real images received from the Robot Server or provided by 
the user, and then computes a set of descriptors that identify uniquely the objects 
contained in the different scenes. These descriptors, known as “HU descriptors”, are 
based on the invariant-moments theory [7]. Once we compute the “HU descriptors” 
for a given object sample, we apply the learning recognition procedure to know if this 
object belongs to an already learned class or whether it identifies a category that must 
be added to the robot acknowledge set. 

Before the classification process and the HU descriptors extraction of an object the 
image is preprocessed and segmented in order to identify the objects belonging to the 
scene. This action is accomplished through a previous binarysation of the scene and 
then a simple segmentation algorithm that searches from top to bottom and from left to 
right the different objects to be treated. The idea is to make it simple enough in order 
to obtain as much performance as possible. 

In order to select the best algorithm that fits our needs, we have implemented sev- 
eral distances and learning recognition algorithms, and then we have extracted differ- 
ent timing and quality recognition results that are explained in the next section. We 
will be able then to select the best set of algorithms needed for our current necessities. 



Distances 



The distances selected to perform the comparative analysis have been “Euclidean 
distance”, “Normalised distance”, “Extended Euclidean distance”[10] and “Per-class 
Extended Euclidean distance”. The last three follow the same philosophy that is ap- 
plying a weighted Euclidean distance to the object samples (see formulaes (1), (2) and 
(3)). The only difference between them is the definition of these weights. Note “n” as 
the number of elements that define a HU descriptor, “wi” as the weight applied to an 
element of a HU descriptor, and “i” as the integer varying between 1 to “n” that identi- 
fies a component of the HU descriptor array. 



d{X,Y)~ 




( 1 ) 



Formulae (1). Generic distance between vectors X and Y. Depending on the distance 
algorithm applied, the weights will vary accordingly. 
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Formulae (2). Average between the samples belonging to a class for the “i” descrip- 
tor. 
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( 3 ) 



Formulae (3). Deviation of the samples belonging to a class for the i descriptor. 



As we can see in Formulae (4), the “Normalised distance” uses as weights the in- 
verse of the variance for every class. It means we must manage and update at any time 
the per-class variance in order to apply this distance. But the point is: “The weights are 
different for each class” so, its cost will be higher. 



1 




( 4 ) 



Formulae (4). Normalized distance weight definition. As more stability presents a HU 
descriptor column, more significant will it he for the recognition criteria. The weights 
are different for each one of the classes. Note “k” as identifying the specific class and 
“i” as the column belonging to the HU descriptors array. 



On the other hand, we designed the “Extended Euclidean distance” in order to 
speed up the “Normalised distance” (formulae (5)). It is based on a previous statistical 
study that defines the constant weights that will he used by this distance. The impor- 
tant thing is that the weights to be applied are known in advance and we do not have 
to manage any per-class variance in order to implement the recognition procedure. 
Besides this, as explained in [10], in order to perform this previous statistical study, 
we must know in advance the set of classes that will be entered into the system. Basi- 
cally, it is designed for closed environments where the classes are well known in ad- 
vance. 




Formulae (5). Extended Euclidean distance definition. This takes into account the 
negative effect that have the biggest HU components into the recognition process. 
Thus, it extends the Normalised approach in order to benefit the smaller HU compo- 
nents. Besides this, as the objective is to have a constant weight for the whole recog- 
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nition algorithm, the formulae represented above is applied. Note “N” as the number 
of classes. 

And finally, in order to include the “Extended Euclidean distance” capabilities to 
an open system where the classes are not well known in advance, we have designed 
the “Per-class Extended Euclidean distance” (see Eormulae (6)). Basically it is a mix- 
ture of the “Extended Euclidean distance” and the “Normalised” one. First of all, it 
defines different weights for each class into the acknowledge base (as Normalised 
does), and finally, it takes into account scaling properties of each HU descriptor as the 
“Extended Euclidean distance” does define. 

1 1 



Ski rriki 

Formulae (6). Per-class Extended Euclidean Distance definition. The weights to be 
applied are different for each on of the classes. It takes into account the component 
scale in order to perform the recognition. 



Learning Algorithms 

In order to perform the comparative of different recognition algorithms we have 
selected two of them, “Minimum Distance Classifier” and “K-Nearest Neighbors 
Classifier”, and then we made an extension to their definition in order to give them 
learning capabilities. Basically, when a sample is not recognised as belonging to any 
class of the current acknowledge base we ask the user for a new class name and add it 
to the classes database. Once this happens the system is able to recognise samples 
belonging to that class, and that means the program has learned a new class definition. 
On the other hand, when a sample is conceived as belonging to an already known 
class, the class definition is refined too. For the “Minimum Distance Recognition” it is 
accomplished by maintaining into the class definition the average of the its whole 
sample set. This means, if a new sample belongs to the class, the average must be 
updated accordingly. Besides this, for the “K-Nearest neighbors”, we have to add the 
new sample definition to the acknowledge database as it is. 



6 Results 

The main objective of the present study is selecting the best combination of dis- 
tance and recognition algorithm that accomplishes the process as fast as possible and 
with an acceptable quality. To do so a battery test has been designed by taking 120 
images belonging to 6 different kinds of objects. We can see a sample of each class in 
fig. 3. 
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Fig. 3. Classes of objects for the battery test. 



As we can see in fig. 4, we have represented the 8 combinations between distances 
(Euclidean, Modified, Normalised and Per-Class) and classifiers (Minimum Distance 
and K-nearest neighbors). For every one of these combinations we launch the proce- 
dure to classify the 120 samples. The result is the time in seconds expended to classify 
the 120 samples and the % of samples recognised properly. 

Besides this, we consider the best algorithm as the one that maximises the % of 
samples recognised properly divided by the time spent to do so. This relationship can 
be observed in fig. 5, in graphical manner. 




Fig. 4. Success X Time Results. For every algorithm it represents the % Suc- 
cess and the Time that expended to recognise the 120 samples. 

As we can see in fig. 4 the algorithm to apply seems to be the “Extended Euclidean 
Distance” with the “Minimum Distance Classifier”. It performs very fast and offers a 
% of success over the 95 %. Besides this, we introduced in the section above that the 
“Extended Euclidean Distance” needs to have a previous statistical analysis in order to 
calculate the constant weights that work better. This kind of distance has been de- 
signed for systems where the classes managed are always the same. If a new class has 
to be added to the acknowledge database the statistical process should be carried out 
again. 

For open systems, where the number of classes to be managed is not limited, we 
should select the next algorithm. It is the one composed by the “Normalised Distance” 
and the “Minimum Distance Classifier”. 
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□ Euclidean + Extended Minimum Classifier 

□ Extended Euclidean Distance + Extended 
Minimum Classifier 

[D Normaiised Distance + Extended Mkiimum 
Ciassifier 

□ Per-class Extended Euclidean Distance + 
Extended Minimum Classifier 

□ Euclidean Distance + Extaided KN 
Ciassifier 

B Extended Euclidean Distance + Extended 
KN Classifier 

[D Normalised Distance + Extended KN 
Classifier 

B Per-class Extended Euclidean Distance + 
Extended KN Classifier 



Algorithms 



Fig. 5. Productivity Results. For every algorithm it represents the % Success/Time. The best 
algorithm will be the one that is able to recognise properly more samples in a little amount of time. 



7 Conclusions and Future Work 

The conclusions of the present study are both, using the “Extended Euclidean Dis- 
tance” for close systems (always managing the same classes) and using the “Normal- 
ised Euclidean Distance” for open systems. Both of them are to be used with the “Ex- 
tended Minimum Distance Classifiers” that provides learning capabilities to the tradi- 
tional “Minimum Distance Classifier”. The “K-nearest neighbors Classifier” may not 
be used because it is computationally too expensive and would make impossible the 
real-time approach required to our telerobotic system. 

Euture steps in the project will be oriented towards completing and improving the 
set of services that each component of the system must provide. In the user interface, 
two main points to improve are, in order to allow a more natural interaction with the 
user, the support for voice command input and the richness of the language the system 
can understand. We also plan to implement task learning capabilities, so that the user 
can define tasks as a sequence of other already-known ones. Besides this, the next 
version of the user interface will allow the user to manipulate the objects on the scene 
graphically by simulating the results before sending the command to the Remote 
Server. It means the user will be able to control the robot using both, natural language 
and mouse interaction. 

As a final conclusion, we believe that in the years to come more intelligence is ex- 
pected to be added to both the robot and the interface sides of the system, making 
user-robot interaction flow in a more natural way. 
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Abstract. Clustering in Metric Spaces can be conveniently performed 
by the so called k-medians method. It consists of a variant of the popu- 
lar k-means algorithm in which cluster medians (most centered cluster 
points) are used instead of the conventional cluster means. Two main 
aspects of the fe-medians algorithm deserve special attention: computing 
efficiency and initialization. Efficiency issues have been studied in pre- 
vious works. Here we focus on initialization. Four techniques are studied: 
Random selection. Supervised selection, the Greedy-Interchange algo- 
rithm and the Maxmin algorithm. The capabilities of these techniques 
are assessed through experiments in two typical applications of Cluste- 
ring; namely. Exploratory Data Analysis and Unsupervised Prototype 
Selection. Results clearly show the importance of a good initialization of 
the k-medians algorithm in all the cases. Random initialization too often 
leads to bad final partitions, while best results are generally obtained 
using Supervised selection. The Greedy-Interchange and the Maxmin al- 
gorithms generally lead to partitions of high quality, without the manual 
effort of Supervised selection. From these algorithms, the latter is gene- 
rally preferred because of its better computational behaviour. 

Key words: Clustering, Metric Spaces, K-Medians algorithm, K-Medi- 
ans initialization, Greedy-Interchange algorithm, Maxmin algorithm 



1 Introduction 

One of the most popular clustering techniques is the so-called k-means, c-means 
or basic ISO DATA algorithm m- It aims to partition the data into k clu- 
sters so that the sum of squared Euclidean distances between samples and their 
corresponding cluster means is minimized. Given k initial estimates of cluster 
means, it alternates two basic steps under an iterative scheme. These steps are 
the classification of samples in accordance with their nearest cluster means, and 
the computation of new cluster means. Each new partition decreases the sum 
of squared distances between samples and their corresponding cluster means. 
Although the fc-means algorithm is suboptimal, it generally achieves good ap- 
proximate solutions at the expense of a moderate computational cost. 
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It is easy to modify the fc-means algorithm for the case where data cannot be 
adequately represented in a suitable vector space, though a metric is available to 
measure the dissimilarity between data points. To do this, each cluster mean is 
approximated by a “most centered sample” or median; that is, a sample whose 
sum of distances to all cluster samples is minimum. We call this strictly distance- 
based clustering technique k-medians or k-centroids 0. 

Practically speaking, there are two major aspects of the fc-medians algorithm 
which deserve special attention: its computing cost and its initialization. Com- 
putational aspects have been carefully studied in previous works m- In PI we 
proposed a fast version of the /c-medians algorithm which basically consists of 
introducing a fast Nearest- Neighbour search technique for efficiently computing 
the closest median of a sample. On the other hand, in |E| we proposed a fast 
median search technique which can be used to reduce the complexity associated 
with the computation of cluster medians. 

Initialization issues are studied in this paper. We compare four initialization 
techniques for the fc-medians algorithm: random selection, supervised selection, 
greedy-interchange algorithm and maxmin algorithm. Random selection simply 
consists of picking k cluster seeds at random. This is an efficient standard in- 
itialization technique, but it often selects seeds that are close together and thus 
low quality partitions. Supervised selection assumes that a small subset of sam- 
ples can be labeled in accordance with a tentative classification scheme. If such 
assumption is reasonable, seeds can be selected class by class to ensure better 
dispersion than in the case of random selection. This is the same purpose of the 
greedy-interchange and maxmin algorithms, though manual effort is replaced by 
computing cost in this case. The former consists of two time consuming heuri- 
stics for the fc-medians clustering problem which are applied consecutively. The 
latter is a slightly modified version of an efficient initialization technique for the 
fc-means algorithm jH). 

The greedy-interchange and maxmin algorithms are described in sections 0 
and 0 respectively. Experiments are aimed at assessing the capabilities of the 
different initialization techniques in two typical applications of clustering; na- 
mely, exploratory data analysis and unsupervised prototype selection. Results 
with synthetic data (Gaussian mixtures) as well as real data (human banded 
chromosomes) are reported in section 0 Conclusions are summarized in sec- 
tion 0 



2 The Greedy-interchange Algorithm 

RT-medians clustering can be properly stated as a combinatorial optimization 
problem. Given a metric space (E, d), a finite set of data points or “prototypes” 
P C E and a positive integer k, we seek a subset of k cluster “representatives”, 
Q d P, for which the following criterion is minimized: 

z{Q) = V] min d{p, q) 
q£Q 

pGP 



( 1 ) 
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Note that there are different ways to assign n — k non-representative 

prototypes to the same k representatives but, among them, we are only inte- 
rested in the partition which results from assigning each prototype to its nearest 
representative. Criterion m measures the sum of distances associated with such 
minimum distance partition. 

The fc-medians clustering problem, well-known as a prototype location pro- 
blem nm, is NP-Hard Pj. Despite of this negative result, there are quite a few 
heuristics other than the fc-medians algorithm that provide good approximate 
solutions m- Most of these heuristics, however, cannot be used for clustering 
large data sets because of their high complexity. Two of the fastest heuristics 
are the greedy and interchange algorithms described below. 



2.1 The Greedy Algorithm 

As its name indicates, this algorithm follows a greedy strategy to draw fc repre- 
sentatives from the prototypes. In iteration t, the set of t — 1 previously selected 
representatives is enlarged with a prototype that leads to a maximum decrease 
of (Pi. That is, the set of representatives selected in iteration t, 0 < t < k, is: 






0 if t = 0, 

U {qt) if t > 0 



where qt = argmin z{Q* ^ U {p}) 
peP-Q*~^ 



This algorithm computes the criterion function (n — fc)fc times approxima- 
tely, and hence a direct implementation makes (n — fc)^fc^ distance computations. 
Fortunately, this computing cost can be notably reduced by simply introducing 
an auxiliary array of distances between prototypes and their nearest representa- 
tives (see fig.QI). Its complexity is of (n — fc)^fc distance computations approxi- 
mately. 



Algorithm greedy(P, d, k\ Q d P) 

Variable: D G /* auxiliary array: Dp = miiiqgQ d{p, q) */ 

Method: Q ^ D = (oo)^ 

for t = 1 to fc do /* compute q = qt and add it to Q */ 

minz = 00 
\fp € P — Q do 

= Ep'ep-Qmin(Dp/,d(p',p)) /* z' = z{Q^~^ U {p}) */ 

if z' < minz then minz = z'; q = p 
Q = QU {q}; yp € P — Q do Dp = min(Dp, d(p, q)) 



Fig. 1. Greedy algorithm. 
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2.2 The Interchange Algorithm 

The interchange algorithm starts from a given set of representatives and tries 
to improve it iteratively. In each iteration, the algorithm searches for a pair 
(representative, prototype) whose interchange leads to a minimum increase of (EQ), 



{q , g+) = arg min z{Q - {g} U {g}) 

{q,q)eQxP-Q 

If 0 decreases by interchanging q~ and q~^, then such interchange is carried 
out and the algorithm begins a new iteration; otherwise, the algorithm stops. 

There are (n — k)k possible interchanges in each iteration, and each inter- 
change requires evaluation of (^. Therefore, a straightforward implementation 
of this algorithm computes (n — distances per iteration. As in the case of 

the greedy algorithm, it is possible to reduce this complexity by introducing a 
simple algorithmic refinement (see fig.|2|). 



Algorithm interchange(P, d, k,Q°cP-,QC P) 

Variable: D € 

Method: Q = 
repeat 

interchange = false; minz = 00 

yq (z Q do /* explore possible interchanges with q * ) 

Vp £ P - Q U {(?} do Dp = d(p, q') 

Vp G P — Q do 

2 ' = Ep'6P-Q-{p}u{,} min(Dp/,d(p',p)) 
if z' < minz then minz = z' \ q~ = q\ q^ — p 
if 2 (Q — {q~} U {?’*’}) < 2(Q) then /* advantageous interchange */ 

interchange = true; Q — Q — {g~} U {?'*’} 
until -z interchange 



Fig. 2. Interchange algorithm. 



For each representative q, the refined method efficiently explores all possi- 
ble interchanges with q by introducing an auxiliary array, D. For each non- 
representative prototype p, the minimum distance between p and all representa- 
tives except q is computed and stored in I?. In this way, evaluation of the n — k 
possible interchanges with q can be performed by computing (n — k)^ distances. 
In total, the number of distance computations per iteration coincides with that 
of the greedy algorithm (i.e. (n — 

2.3 Combination: The Greedy-Interchange Algorithm 

We call greedy-interchange combination (algorithm) to the consecutive applica- 
tion of both algorithms. Despite of the computational improvements discussed 
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above, the complexity of this combination is too high for clustering of a large 
data set. Nevertheless, it is still possible to use this combination with a (random) 
subset of moderate size so as to obtain an initial solution for the fc-medians algo- 
rithm. In this case, the greedy-interchange algorithm can be efficiently applied 
by preprocessing the matrix of pairwise distances between prototypes. 



3 The Maxmin Algorithm 



The maxmin algorithm is a slightly modified version of an efficient initialization 
technique for the fc-means algorithm |^. It can also be seen as a fast approxi- 
mate greedy algorithm. As this technique does, the maxmin algorithm iteratively 
selects one representative at a time. In iteration t, the set of f — 1 previously 
selected representatives is enlarged with the prototype whose distance to its clo- 
sest representative is maximum. That is, the set of representatives selected in 
iteration t, 1 < t < k, is: 






arbitrary(P) if t = 1, 
U {qt} if t > 1 



where Qt = argmax min d{p,q) 

p^P-Qt-i geQ*-i 



A detailed description of this algorithm is shown in fig. 0 This description 
includes an auxiliary array which is used just as the auxiliary array included in 
the greedy algorithm (fig.0). The number of distances computed by the maxmin 
algorithm is approximately (n — k)k. Note that this number is very small in 
comparison with that of the greedy algorithm. 



Algorithm maxmin(P, d, k\ Q d P) 

Variable: D € /* auxiliary array: Dp = miriqgQ d{p, q) */ 

Method: Q — i/}\ D = (oo)^; q = arbitrary{P) 
for t = 1 to k do 

Q = QU {q}; maxmin = 0 
Vp £ P — <5 do 
dpq = d{p, q) 

if dpq < Dp then Dp = dpq 

if Dp > maxmin then q = p\ maxmin = Dp 

Fig. 3. Maxmin algorithm. 



4 Experiments 

As in the case of the fc-means algorithm, the fc-medians algorithm can be used 
both for data exploration and to provide prototypes for use in supervised classi- 
fiers. Assume that prototypes are grouped into fc* compact, well-separated clu- 
sters of similar a priori probabilities, and that each cluster can be appropriately 
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modeled by a single representative. Then, application of the fc-medians algorithm 
with k = k* should discover the natural groups and provide an optimal set of 
representatives. Although this is a rather unrealistic assumption in many cases, 
much effort can be saved in early stages of classifier design if such assumption 
can be assessed. If not, application of the fc-medians algorithm is still advan- 
tageous when a large set of unlabeled prototypes is available but only a small 
fraction of the prototypes is to be used for (re)training a classifier. In this case, 
the unconditional distribution of the whole set can be adequately approximated 
with the set of representatives selected by the fc-medians algorithm. 

The experiments reported hereafter have been designed to study the perfor- 
mance of the /c-medians algorithm as: a) an exploratory data analysis technique; 
and b) a procedure for unsupervised selection of prototypes. As we will see, this 
performance depends very much on the initialization technique. 

4.1 Exploratory Data Analysis 

To study the fc-medians algorithm as an exploratory data analysis technique, a 
simple classification problem has been chosen. It involves 10 equally-probable 
classes of 8-dimensional normal densities, with well-separated means, and com- 
mon covariance matrix S = 0.0307 1. Two sets of 10, 000 independent prototypes 
each were drawn from this mixture for training and test purposes. 

The Bayes classifier for this problem is linear and can be implemented as a 
minimum Euclidean distance classifier by using the true class means as prototy- 
pes. Its empirical error rate is 2.1%. This error rate is also achieved when true 
class means are replaced by their empirical estimates, but it reaches 2.8% when 
means are approximated by class medians. If these class medians are used to in- 
itialize the fc-medians algorithm, then a set of representatives is obtained whose 
associated minimum distance classification error rate is 3.0%. On the other hand, 
the nearest-neighbour rule based on the whole training set missclassifies 4.0% of 
the test prototypes. 

Taking into account these figures, the application of the /c-medians algorithm 
should be considered “successful” if, with k = 10, it provides a “quasi-optimal” 
set of representatives; that is, one whose associated minimum distance classifi- 
cation error rate is 3.0%. As it will be seen, success is closely related with good 
initialization: if the algorithm is not appropriately initialized, then it will fail 
to pick one representative of each class and hence the error rate is expected to 
increase dramatically. This is certainly true if values of k smaller than 10 are 
tried. On the other hand, this undesirable behaviour is expected to be reduced 
by using larger values of k. 

The fc-medians algorithm was executed 50 times for each one of the four initia- 
lization techniques previously discussed and each k € {5,6,... ,30}. Although 
we always used the same training set of 10, 000 training prototypes in these exe- 
cutions, a different initial set of representatives was obtained each time: supervi- 
sed selection was based on randomly selected class seeds; the greedy-interchange 
algorithm was tested on randomly chosen subsets of 100 prototypes; and the 
maxmin algorithm was started from a prototype also chosen at random. We 




848 



A. Juan and E. Vidal 



never took advantage of the prototypes coordinates; only Euclidean distances 
were used. Results are shown in figure 0 This figure encompasses eight panels 
distributed in four rows and two columns. Each row is associated with a dif- 
ferent initialization technique, while the left and right columns correspond to 
initial and /c-medians-optimized sets of representatives, respectively. Each plot- 
ted point in these panels represents the error rate (E) of the nearest neighbour 
classifier based on a different set of representatives. An average error rate curve 
is included in each panel to help with interpretation of results. 

As expected, the performance of the fc-medians algorithm depends very much 
on the initialization technique used. The better the initial solutions are, the more 
quasi-optimal solutions are found. This tendency is quite clear when comparing 
random selection with the other initialization techniques. Random selection al- 
most always fails to provide good initial sets of representatives and, in conse- 
quence, it often leads to clearly suboptimal solutions {E > 10%). On the con- 
trary, the initial solutions provided by the other techniques hardly ever lead to 
such suboptimal results. A finer analysis reveals, however, that minor differences 
in quality do exist among these techniques. Although the best results correspond 
to supervised selection, they are very similar to those obtained by the greedy- 
interchange combination. This excellent outcome confirms that spending manual 
effort does not pay off when techniques are available that can do the same job 
at the expense of computing cost. In fact, even better results were observed for 
the greedy-interchange algorithm in further experiments (not reported here for 
brevity) testing this algorithm on random subsets of more than 100 prototypes. 
Unfortunately, its high complexity renders this technique useless for clustering 
large sets. In such case, the maxmin algorithm is perhaps the best choice. Despite 
the fact that this technique generates slightly worse results than those provided 
by the greedy-interchange combination, its complexity matches well with that 
of the /c-medians algorithm. On the other hand, suboptimal solutions can be 
eventually circumvented by testing several sets of representatives or using values 
of k larger than the number of natural groups. As expected, it is observed that 
chances of being trapped in a suboptimal solution rapidly decrease as larger va- 
lues of k are tried. However, this alternative should be used with caution since 
misleading results can be derived when a natural group is “dissected” which 
would otherwise be appropriately modeled by a single prototype. Moreover, an 
undesirable side effect has been also found in the experiment: the average error 
rate increases with k and quickly approximates that of the nearest-neighbour 
classifier based on the whole training set (4.0%). 

Apart from using the error rate associated with a set of representatives as 
a measure of its quality, we also used the average Euclidean distance between 
the prototypes and their closest representatives; that is, a normalized version 
of the /c-medians clustering criterion (P). As in the case of the error rate, this 
parameter was estimated from the 10, 000 test prototypes instead of the training 
prototypes to assure better statistical independence. Results, omitted here for 
the sake of brevity, show basically the same tendencies as those of fig. El 
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Fig. 4. Results obtained whith the fe-medians algorithm tested as an exploratory data 
analysis technique. Each row of panels is associated with a different initialization tech- 
nique; left and right columns of panels correspond to initial and optimized sets of 
representatives, respectively. Each plotted point in these panels represents the error 
rate of the nearest neighbour classifier based on a different set of representatives. 
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4.2 Unsupervised Selection of Prototypes 

Two classification problems have been considered so as to test the fc-medians 
algorithm as a procedure for unsupervised selection of prototypes. As in the 
case of the simple problem considered in the previous section, the first classifica- 
tion task involves 10 Gaussian equally-probable classes. Now, however, we have 
chosen 10-dimensional normal densities of different means and also different co- 
variance matrices. A total of 20, 000 independent prototypes were extracted from 
this mixture; half for training and half for testing. Some error rates computed 
from these prototypes are: 9.9% for the Bayes classifier implemented as a Gaus- 
sian classifier from the true parameters of the densities; 22.3% for the minimum 
Euclidean distance classifier based on empirical class means; 31.5% for the same 
classifier based on empirical class medians; and 21.7% for the nearest neighbour 
classifier designed from the whole set of 10, 000 training prototypes. Glearly, this 
task is much more difficult than the previous simple problem. The error rate of 
a nearest neighbour classifier based on the prototypes selected by the fc-medians 
algorithm will heavily depend on the value of k used. The larger value of k is 
used, the smaller error rate is expected. For instance, if a small value such as 
/c = 10 is used, then an error rate not smaller than 31.5% should be expected. 
On the other hand, it is clear that rates close to 21.7% will be obtained for values 
of k approximating the number of available training prototypes. 

The second task consists of classifying human banded chromosomes represen- 
ted as strings. The data used for this task was extracted from a database of ap- 
proximately 7, 000 chromosomes that where classified by cytogenetic experts jn|. 
Each digitized chromosome image was preprocessed through a procedure that 
starts obtaining an idealized, one-dimensional density profile that emphasizes 
the band pattern along the chromosome. The idealized profile is then mapped 
nonlinearly into a string composed of symbols over a certain alphabet. A total of 
4400 samples was collected, 200 samples of each of the 22 non-sex chromosome 
types 0. The standard procedure for estimating the error rate of a classifier 
applied to this task is a 2-fold cross-validation in which both sets are chosen to 
have 100 samples of each of the 22 non-sex chromosome types. Following this 
procedure, we have recently obtained an excellent error rate of 4.9% by appli- 
cation of the 12-nearest neighbours decision rule based on a time consuming 
normalized edit distance P]. 

Three techniques have been compared in both tasks: a) random selection 
alone; b) random selection followed by the fc-medians algorithm; and c) the k- 
medians algorithm initialized by the randomly started maxmin method. Supervi- 
sed selection and the greedy-interchange combination have not been considered 
because of their high cost. For the first task, the three techniques considered 
have been executed 50 times for each fc G {10, 20, 50, 100, 200, 500, 1000, 2000, 
5000, 10000}. The same training set of 10, 000 training prototypes and the Euc- 
lidean distance were always used in these executions. The average error rate of 
the nearest neighbour classifier based on the representatives selected is shown 
in the panel at the left of figure 0 as a function of k. For the second task, va- 
lues of fc G {22,44,88,220,440,880,2200} have been tried. For each technique 
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and each of these values, an estimate of the error rate associated with the k- 
representatives-based 12-nearest neighbours classifier was obtained by averaging 
over the estimates computed from 5 executions of the 2-fold cross-validation pro- 
cedure discussed above. Results are shown in the panel at the right of figured 





Fig. 5. Results provided by three techniques for unsupervised selection of prototypes 
tested on two classification tasks involving synthetic (left) and real (right) data. The 
average error rate of the A:'— nearest neighbour classifier (left: k' = 1; right: k' = 12) is 
shown as a function of the number of representatives selected (fc). 



From the results of figure 0 it is clear that random selection is significantly 
improved by using the fc-medians algorithm, and this is particularly true when 
the maxmin algorithm is used as initialization technique. For instance, in the 
second task, random selection leads to an error rate of 7.6% for k = 880, while 
the maxmin initialization reduces this figure to 5.8%. Although these results are 
quite satisfactory, they are not as good as one would expect. It would be nicer 
to see error rate curves ending with large nearly flat shapes but, in contrast, 
they markedly decay until the largest values of k tried. Obviously, larger sets of 
training prototypes would compensate for the slow rate of convergence showed 
by the (A:'— )nearest neighbour(s) decision rule. On the other hand, this slow 
rate of convergence might be accelerated through optimization of the standard 
decision rule from the data. 

5 Conclusions 

Four initialization techniques for the k-medians clustering algorithm have been 
compared: random selection, supervised selection, the greedy-interchange algo- 
rithm and the maxmin algorithm. The capabilities of these techniques have been 
assessed through experiments in two typical applications of clustering; namely, 
exploratory data analysis and unsupervised prototype selection. Results clearly 
show the importance of a good initialization of the fc-medians algorithm in all the 
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cases. Random initialization too often leads to bad final partitions, while best 
results are generally obtained using supervised selection. The greedy-interchange 
and the maxmin algorithms generally lead to partitions of high quality, without 
the manual effort of supervised selection. From these algorithms, the latter is 
generally preferred because of its better computational behaviour. 
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Abstract. We present a very simple unsupervised vector quantizer 
which extracts higher order concepts from time series generated from 
sensors on a mobile robot as it moves through an environment. The vec- 
tor quantizer is constructive, i.e. it adds new model vectors, each one 
encoding a separate higher order concept, to account for any novel si- 
tuation the robot encounters. The number of higher order concepts is 
determined dynamically, depending on the complexity of the sensed en- 
vironment, without the need of any user intervention. We show how the 
vector quantizer elegantly handles many of the problems faced by an exi- 
sting architecture by Nolfi and Tani, and note some directions for future 
work. 



1 Introduction 

As a mobile robot moves through an environment, it receives a sequence of inputs 
through its sensory equipment, this sequence of inputs is called the ‘sensory flow’. 
The sensory flow can easily be in the order of thousands, or even millions, of 
discrete samples. Finding relations and reoccurring phenomena in this sequence 
is a computationally intractable task, especially when we receive noisy or even 
faulty inputs which need to be Altered out. However, instead of working on 
the sensory input sequence directly, abstractions can be formed which capture 
the general characteristics of the inputs instead of each individual input. For 
instance, when the robot is moving down a corridor, it receives basically the 
same type of inputs time step after time step; a wall to the left and a wall to the 
right. As the robot encounters a fork in the corridor, the inputs change radically, 
suddenly the front sensors may become active and the left and right sensors no 
longer sense walls. These distinct changes in the sensory inputs can be exploited 
by the robot. If the task of the robot requires it to remember the path it just 
took through a maze it could, instead of storing each individual input it received, 
requiring extensive memory capabilities, store an abstraction of the inputs, e.g. 
‘corridor, left turn, room, corridor’, requiring only a fraction of storage space. 
It could later use this stored abstraction to navigate through the maze, or to 
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output a description to the user of what the robot has perceived. Similarly, if the 
task involves finding relations between events that are far apart in time, an error- 
signal will vanish as it passes through thousands and thousands of intermediate 
sequence elements, making it practically impossible to find such relations. This 
problem becomes more manageable if the input sequence is segmented into a 
sequence of higher order concepts, each represented by a unique symbol. Finding 
relations between these symbols is a much more viable task than working directly 
on the input sequence. 

The task of finding reoccurring sub-sequences is however a quite complex 
task if there are no clearly defined boundaries in the input sequence. However, 
the sequence used in this paper has easily identifiable and stable regions with 
fairly sharp transition borders, similar to the sequence shown in Figure E It 
is generated from the sensors of a mobile wall- following Khepera robot, which 
moves through a simulated environment. 




jumps between regions of fixed signal mean with added noise; the task is to find the dif- 
ferent signal levels and to detect the transitions between them. Labelling each segment 
with a symbol, the entire seqnence can be stored (with some loss of information) using 
just seven symbols instead of hundreds, or even thousands, of distinct input values. 
The length of each sub-sequence is however lost during the mapping 



Nolfi and Tani |7IH| conducted experiments using a similar wall-following 
robot and segmented the sensory flow using a hierarchical neural network archi- 
tecture consisting of several prediction and segmentation networks. While their 
system managed to extract higher order concepts from the sensory flow, such as 
‘walls’, ‘corners’ and ‘corridors’, it had problems finding sub-sequences which did 
not occur very often, yet were quite distinct. Furthermore, they needed to ma- 
nually specify exactly how many higher order concepts the system should split 
the sequence into instead of letting the system decide this on its own, depending 
on the complexity of the sequence. 

In the following, a simple constructive vector quantizer which solves these 
problems efficiently is designed. It finds an ‘appropriate’ segmentation using 
just a single presentation of the input sequence, unlike Nolfi and Tani’s me- 
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thods which require repeated presentations of the input sequence. Higher order 
concepts which do not occur very frequently, but are distinct, are successfully 
extracted by the system. Furthermore, the system determines the number of 
higher order concepts automatically, based on the sequence characteristics. 

In section 2, Nolfi and Tani’s original experiments are summarized, and a 
number of problems with their system are highlighted. In section 3, we describe 
the resource allocating vector quantizer, which solves these problems. It is defined 
mathematically and results of the experiments conducted with this architecture 
are presented. Finally, in section 4, the main advantages of the new approach 
are summarized. 



2 Existing Methods 

In experiments carried out by Nolfi and Tani PE], different neural network 
architectures were investigated which segmented sensory input sequences from 
mobile robots. In the former paper |3, they designed a modular system of gated 
experts, where each module represented a sub-sequence. In the latter paper [7] 
they presented an altered architecture which involved a simpler, yet still quite 
complex, hierarchical architecture which is described below. 



2.1 Architecture 

The hierarchical architecture Nolfi and Tani [7| used consisted of a first le- 
vel input prediction network, a segmentation network, and a second level sub- 
sequence prediction network. 

The first level prediction network was a recurrent network with 10 input 
units (encoding the 8 sensor values and 2 motor values at time t), 3 hidden 
units, and 8 output units (encoding the expected 8 sensor values at time t -I- 1). 
The activation of the hidden units at the previous time step was fed into 3 
additional input units the succeeding time step, providing a memory of previous 
events which could help in predicting the next inputs. 

The activation of the hidden units of the first level prediction network con- 
stituted the input to the segmentation network, which thus had 3 input units. 
These input units were connected to a pre-defined number of ‘winner take all’ 
output units which each represented a different higher order concept. Nolfi and 
Tani used 3 such output units. They argued that a segmentation based on the 
hidden unit activation of a prediction network, instead of using the input se- 
quence directly, allowed enhancement of less frequent sub-sequences. 

The segmentation network was updated in an unsupervised manner, similar 
to a Self-Organizing Map 0 with neighbourhood range set to zero (i.e. no neig- 
hbours were updated). There was also a second level prediction network which 
tried to find regularities in the sequence of extracted higher order concepts. This 
particular aspect is however not relevant here, but the reader is encouraged to 
read the original paper for more details about this. 
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2.2 Experiments 

The environment consisted of two simulated rooms of different sizes, connected 
together by a short corridor. The robot, a simulated Khepera robot, was control- 
led by a fixed wall-following behaviour which was not affected by the prediction 
and segmentation networks. The networks merely were idle observers, trying to 
find regularities in the sequence of inputs. Nolfi and Tani’s experiments are here 
replicated, using Olivier Michel’s publicly available Khepera Simulator 0, and 
the resulting segmentation is depicted in Figure 0 




Fig. 2. The simulated environment and the segmentation acqnired using the Nolfi and 
Tani approach. Each unit in the segmentation network has been assigned a different 
shade; the shade of the winning unit at each time step is shown. The simulated Khepera 
robot is shown to the right; it has eight distance sensors and two motors, distance sensor 
values are in the range [0,1023] and motor values are in the range [-10,10] 



As noted by Nolfi and Tani, the extracted sub-sequences can be described as 
‘walls’ (light gray), ‘corridors’ (gray) and ‘corners’ (black). 



2.3 Problems 

The architecture used in Nolfi and Tani’s experiments required many repeated 
presentations of the same input sequence (Nolfi and Tani used over 300 laps in 
the environment) in order to extract the sub-sequences, which made it computa- 
tionally intractable to repeat the experiments with different parameter sets. This 
is a problem since they had many user-specified parameters which all influenced 
the outcome of the segmentation, e.g. number of hidden units in the prediction 
nets, choice of learning rates, weight initialization, decay value, etc., the choice 
of which could lead to very different segmentations. 

Moreover, the training had to be split into different phases, one for training 
the first level prediction network, another for training the segmentation network. 
The duration of these phases also needed to be decided by the user. 

Further, the system could not detect situations which had low density, i.e. 
that did not occur very often or sustain for a long period of time, for example 
very short corridors. Even more severe, the system could suffer from catastrophic 
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forgetting if new situations arose which lead to a reallocation of the hidden 
activation space or the first layer prediction network. Thus making the existing 
segmentation layer weights inappropriate or even invalid. 

Finally, the user was forced to manually specify the number of categories 
which should be extracted, instead of letting the system determine this on its 
own as the robot negotiated the world. 

The system suggested in the following, deals with all of the above identified 
problems. It has the ability to segment the input sequence using a single presen- 
tation, without having to split the training into different phases. It also operates 
directly on the input sequence, avoiding the risk of catastrophic forgetting. Fi- 
nally, the system determines the number of higher order concepts automatically 
depending on the complexity of the input sequence. It is also able to handle 
situations where new concepts are introduced dynamically. 



3 The Resource Allocating Vector Quantizer 

The resource allocating vector quantizer (RAVQ) represents higher order con- 
cepts using model vectors. Additional model vectors are allocated dynamically 
when new and stable situations are encountered. 



3.1 Related Work 

The process of using model vectors to represent categories has been successfully 
employed in a number of different architectures, e.g. in Kohonen’s Learning 
Vector Quantization and Self-Organizing Maps 0. Such networks however rely 
on a fixed number of model vectors. Careful analysis of the complexity of the 
input signal has to be performed by the user in order to specify an appropriate 
number of units. Using too few units will force the system to disregard some 
of the possible categories, while too many may create unwanted, or spurious, 
categories. 

This problem has been alleviated through the development of constructive 
systems, e.g. systems by Platt 0, Fritzke |3|, Chan and Vetterli P), i.e. systems 
which are able to allocate further resources whenever deemed necessary. While 
a variety of such systems are employed today, the family of adaptive resonance 
theory (ART) networks by Carpenter and Grossberg P are most relevant to the 
design of the RAVQ. 

The ART networks classify inputs into categories, where each category has 
its own prototype. The prototype depicts the typical input pattern which is as- 
sociated with the category. When new input patterns are encountered which 
do not closely match any of the existing categories, further categories might be 
created. This corresponds to the allocation of an additional output unit and a 
corresponding prototype. The new prototype is initialized to match the present, 
unfamiliar, input pattern. The ART networks do however have problems coping 
with noisy input patterns which can result in the incorporation of spurious ca- 
tegories (jS] page 138). Many such noisy inputs can however be filtered using 
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a simple method employed by the RAVQ, which stores a number of previous 
inputs to provide a context for the current input pattern. 



3.2 Theory 

The RAVQ is specifically constructed for sequentially ordered inputs where each 
model vector comes to represent a different sub-sequence. The model vectors 
can be viewed as points in the input space, not as sequences or paths. This 
implies that only sub-sequences with stable signal values can be represented 
accurately using such a model vector. This means that the RAVQ is limited to 
input sequences where the signal mean basically remains fixed for a period of 
time with some occasional transition to a new signal mean. As is shown here, 
this is adequate for the segmentation of a sequence generated from the sensors 
of a mobile Khepera robot. 

The RAVQ has only three user specified parameters: a window size n, a 
mismatch reduction requirement S, and a stability criterion e. The system forms 
a set of model vectors which are placed approximately in the centre signal- value 
of the sub-sequence it is meant to represent. Initially empty, the set of model 
vectors is increased as soon as novel, stable, situations are encountered. This is 
done as follows: 

A moving average of the last n inputs is calculated in order to filter out 
noise in the input signal. This moving average is incorporated as a new model 
vector (we denote the moving average as a ‘model vector’ as soon as it has 
been incorporated) if it characterizes a stable mismatch reducing situation, i.e. 
it meets both the mismatch reduction criterion and the stability criterion: 

— the reduction criterion is that the moving average alone should account for 
the inputs better than the existing model vectors do, reducing the mismatch 
at least by 6, 

— the stability criterion is that the deviation between the inputs and the mo- 
ving average should stay below a certain threshold e, otherwise the situation 
is not characterized as being stable (the robot may be switching between 
existing model vectors or experiencing a temporary sensor fluctuation). 

Each of the above criteria is not sufficient on its own, as only having a reduc- 
tion criterion can lead to the incorporation of model vectors for erratic inputs 
(i.e. suffering from the same problems as the ART networks). Only having a sta- 
bility criterion, on the other hand, can lead to the incorporation of new model 
vectors which are virtually identical to already incorporated model vectors. 



3.3 Definition 

The key part of the RAVQ is an input buffer of size n. At each time step this 
buffer stores the last n input vectors x{t) G X. In the first n time steps {t = 
0, . . . ,n — 1), inputs are simply recorded into the buffer and the RAVQ is not 
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activated until time step t = n—1, when the input buffer has been filled. At this 
time, the set M{t) of model vectors m, is initialized to the empty set: 

M(n-1) = 0, (1) 

and at each successive time step t, the finite moving average x(t) is calculated: 

- n—1 

x{t) = . (2) 

^ i=0 

We define a distance metric d{V,X) which specifies the mean of the shortest 
distances between a set of (model / moving average) vectors Vj G V and a set of 
input vectors Xi G X , i.e. the average error for the inputs given a set of vectors: 

d{V,X) = — jnin {\\x,-vj\\};xi ex, VjGV , (3) 

where ||.|| denotes the Euclidean distance. That is, it returns the best match 
between the model vectors and each given input vector. This distance metric 
can be used to calculate the mean distance dx(t) from each of the last n inputs 
to the moving average x{t): 



dx{t) = d{{x{t)},{x{t),...,x{t-n+l)}) , (4) 

The same distance metric is used to calculate the distance dM(t) between 
each of the last n inputs and the best matching model vector at each time step: 

^ ( d{M{t),{x{t),...,x{t-n+l)}) |M(t)|>0 

j otherwise . 

If there are no model vectors in M{t), i.e. the RAVQ has just started, the 
distance dM{t) is set to be sufficient for incorporation of this moving average 
into the set of model vectors. For the moving average to be incorporated as 
a model vector, the mean distance between the last n inputs and the moving 
average x{t) must stay below the threshold e, i.e. it must constitute a stable 
input location, and the improvement which is possible through an incorporation 
of the current moving average into the set of model vectors must exceed the 
minimum improvement requirement S: 



M(t + 1) 



M{t) U x{t) dx(t) < min(e, dM(t) ~ <5) 
M{t) otherwise . 



(6) 



The higher order concept the system is in at time t, is indicated by the 
selection of the best matching model vector win(t) in relation to the moving 
average x{t) at that time step: 



win(t) = ais min |||x(t) — mdll; m, G M(t) . (7) 

l<j<\M(t)\ j 

That is, win{t) specifies the index of the best matching model vector at time 
step t. 
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3.4 Experiments 

Replicating the segmentation part of the Nolfi and Tani |J| experiments, but in 
Olivier Michel’s publicly available Khepera Simulator |S|, and this time using 
the RAVQ, the same segmentation was achieved in only a fraction of the time 
(FigureED- The system operates directly on the input sequence. The input vector 
to the RAVQ had 10 elements, 8 distance sensor values and 2 motor values, all 
normalized to the range [0.0, 1.0]. The parameters used were n = 10, i5 = 0.9 and 
e = 0.2. 




Fig. 3. The resulting segmentation directly from the first lap (about 1,900 time steps) 
in the environment. There is a slight delay of n inputs before the first corridor and 
corner segments are detected. The winning model vector win{t) at each time step is 
indicated using different shades 



The system is also capable of instantaneous learning of new, previously un- 
encountered, situations at any time of the simulation since there is no decaying 
learning rate or similar parameter which would make such learning harder at 
later points in the simulation. For instance, if a turning corridor is added (so- 
mething which was not present in Nolfi and Tani’s original experiments) the 
RAVQ first tries to handle the situation using the existing model vectors; the 
best match is the ‘corner’ model vector. But this model vector does not exactly 
match the situation since there now is a wall on the left side, the RAVQ swiftly 
adds a new model vector for ‘turning corridor’ (Figure^), and the next time 
this situation occurs, this model vector is used accordingly (Figure Et>)- (There 
is no sensor pointing back to the sides, this is why the system produces a ‘wall’ 
segment just before the ‘corridor’ / ‘turning corridor’ since the inputs depict a 
wall to the right but nothing to the left.) 

The architecture used by Nolfi and Tani [Zj is incapable of detecting new 
situations, there is no mechanism for adding more units, representing sub-seque- 
nces, when such units are needed. Further, starting out with ‘extra’ units would 
not help either since they would soon be allocated to better cover the more 
dense input regions, and would subsequently be hard to reallocate to handle 
new situations instead. 
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corridor 



Fig. 4. A new model vector is incorporated as soon as a new situation is detected, for 
instance a corridor which turns (a). The uptake region of the new model vector steals 
space from the existing model vectors; the ‘corner’ model vector is no longer activated 
the next time this particular situation is encountered (b) 



4 Conclusions 

We have described a resource allocating vector quantizer (RAVQ) which is capa- 
ble of single presentation segmentation of the sensory flow of a mobile robot. 
Compared to the earlier models of Nolfi and Tani jZj , the RAVQ approach for 
segmentation is considerably simpler, requiring no division of the training into 
separate phases, and with just three user specified parameters: a window size 
n, a mismatch reduction requirement 6 and a stability criterion e. The RAVQ 
elegantly handles low density inputs; not basing the placement of model vec- 
tors according to input frequency but novelty and stability. Further, and more 
importantly, the RAVQ dynamically determines the appropriate number of ca- 
tegories to use, without the need for any user intervention, and finally, since the 
RAVQ works directly on the input sequence, there is no risk of catastrophic for- 
getting due to a reallocation of the hidden space which could be very damaging 
in previously used methods for this task. 

The RAVQ is limited to sequences where the signal mean basically remains 
fixed for a period of time with some occasional transition to a new signal mean. 
The cause of this limitation is that each higher order concept is represented 
using a single model vector, placed directly in input space. If instead regions or 
paths could be identified in the input space, and represented using model vectors, 
sequences of inputs could form higher order concepts. This could, for instance, 
correspond to situations where a mobile robot travels through a broadening 
corridor or moves diagonally through a room. Future extensions should also 
involve the incorporation of adaptation of the model vectors to better account 
for the higher order concepts which they represent. This could be performed 
using a learning rule similar to that employed in Self-Organizing Maps 0. 

The extracted sequence of higher order concepts can be viewed as an abstract 
representation of the environment, created through the eyes of the robot. The 
granularity of the segmentation can be controlled indirectly through the mis- 
match reduction and the stability criteria. As different levels of segmentations 
may be needed for different tasks, a series of robot tasks should be designed, 
testing how useful the higher order concepts actually are in tackling memory- 
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intensive tasks such as path learning through mazes and finding long time de- 
pendencies in the sensory- motor flow. 
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Abstract. In pattern recognition systems, Chow’s rule is commonly used to 
reach a trade-off between error and reject probabilities. In this paper, we 
investigate the effects of estimate errors affecting the a posteriori probabilities 
on the optimality of Chow’s rule. We show that the optimal error-reject trade- 
off is not provided by Chow’s rule if the a posteriori probabilities are affected 
by errors. The use of multiple reject thresholds related to the data classes is then 
proposed. The authors have proved in another work that the reject rule based on 
such thresholds provides a better error-reject trade-off than in Chow’s rule. 
Reported results on the classification of multisensor remote-sensing images 
point out the advantages of the proposed reject rule. 



1 Introduction 



In statistical pattern recognition, the probability that a given pattern, characterized by 
a featnre vector x, belongs to the /-th class, in a TV-class problem, is provided by the a 
posteriori probability P{a>i\x) throngh the Bayes formnla: 



P(a>i I x) 



p(x I ca,)P(co,) 
p(x) 



/■= 1.....W , 



( 1 ) 



where p(x\a>i) is the conditional probability density function for x in the /-th class, 
P{a>i) is the a priori probability of occurrence of the /-th class, and p{x) is the 
probability density function for x: 

P(x) = Zlli P(x I 

A classification algorithm is aimed to subdivide the featme space into TV decision 
regions Dj, i = 1,...,TV, so that the patterns of the class (p belong to the region Di. 
According to the Bayes theory, the decision regions are defined to maximize the 
following probability of correct recognition, commonly named “accuracy” of the 
classifier: 



Accuracy- Picorrect)- I C0i)P{a),ybc . 



( 3 ) 



' Corresponding author. Phone: -1-39-070-6755874 Fax: -1-39-070-6755900 

F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 863-871, 2000. 

© Springer-Verlag Berlin Heidelberg 2000 




864 G. Fumera, F. Roli, and G. Giacinto 



To this end, each pattern x mnst be assigned to the class for which the P{o)j^) is 
maximnm. This is the so called Bayes decision mle. The classifier that maximizes the 
above correct classification probability is named “optimal Bayes classifier”. On the 
analogy of eqnation 3, it is easy to see that the classifier error probability can be 
compnted as follows: 

Kerr) = L, I 

where P{correct)+P{err)=\. The minimnm of the above error probability can be 
reached by the Bayes mle and it is named Bayes error. 

Theoretically speaking, an error probability lower than Bayes error can be obtained 
nsing the so called “reject option”. Namely, the patterns that are the most likely to be 
wrongly classified are “rejected”, that is, they are not classified. Typically, they are 
then handled by more sophisticated procednres (e.g., a mannal classification process 
is performed). In real applications, the aim of reject option is to safegnard against 
excessive errors in order to obtain the accnracy reqnired by the end-nser of the pattern 
recognition system. However, handling high reject rates is nsnally too time- 
consnming for application pnrposes. In addition, correct classifications may also be 
converted into rejects as the rejection rate increases. Therefore, a trade-off between 
error and reject is mandatory. The formnlation of the best error-reject trade-off and 
the related optimal reject rale was given by Chow [1]. According to Chow’s rale, a 
pattern x is rejected if the maximnm of the a posteriori probabilities is lower than a 
given threshold valne Te [0, 1] : 

max Picot I x) = PicOi I x) < T . (5) 

k=l,...,N 

On the other hand, the pattern x is “accepted” and assigned to the class if: 

max P(a>t I x)-P(a)j I x)> T . (6) 

k=l....N 

The rationale of Chow’s reject rale becomes evident if one observes that 
max P( cOj I x) is the conditional probability of classifying a given pattern x correctly. 

i 

Therefore, for a given threshold T and the related reject rate, the patterns with the 
highest probabilities to be wrongly classified are rejected. A detailed proof of the 
optimality of Chow’s rale can be fonnd in [4]. It is worth noting that, nnder the 
assnmption that the a posteriori probabilities are exactly known, Chow proved that his 
decision rale provides the optimal error-reject trade-off [1]. 

It is easy to see that a classifier nsing reject option snbdivides the featme space into 
A+1 decision regions Di,...,Dm,Do, snch that patterns belonging to Do are rejected, and 
patterns belonging to Dj are assigned to the class cOj. The reject region Do is 
determined according to eqnation 5. Eqnation 6 is nsed for defining the decision 
regions If ;,..., D a?. Using rejection option it makes sense to distingnish between 
rejected and accepted patterns. It is then nseful to define the reject and acceptance 
probabilities. The probability that a pattern is rejected is compnted as follows: 

P{reject)= p(x)dx . 



( 7 ) 
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On the other hand, the probability that a pattern is accepted is: 

P{accept) = 1 - P{reject) = p{x)dx = E, ^ )p(tu j)bc . (8) 



It is worth noting that only the accepted patterns are classified. Therefore, 
P{correct)+P{err)<l. It is also easy to see that P{accept)=P{correct)+P{err), and 
P(correct)+P(err)+P(reject)= 1 . 

For classifiers nsing rejection option the accnracy is defined as the conditional 
probability that a pattern is correctly classified given that it has been accepted: 



y X P(correct, accept) 

Accuracy- Picorrect accept)- — . 

P (accept) 



Finally, according to eqnation 8 and taking into acconnt that 
P(correct,accept)=P(correct) (i.e., only the accepted patterns are correctly or wrongly 
classified), we can write the following eqnation: 



Accuracy- Picorrect \ accept)= 



Pjcorrect) 
P(correct) + P(err) 



(9) 



As previonsly pointed ont, Chow’s reject mle provides the optimal trade-off 
between error and reject only if the a posteriori probabilities of the data classes are 
exactly known. However, in real applications, snch assnmption is not satisfied since 
the available a posteriori probabilities are affected by estimate errors. Therefore, 
approaches different from Chow’s rnle have been proposed to handle the error-reject 
trade-off [2,3]. However, to the best of onr knowledge, no work theoretically 
addressed the problem of the optimal error-reject trade-off when a posteriori 
probabilities are affected by errors. In particnlar, the reject mles proposed in the 
literatme were not theoretically compared with Chow’s one. 

In this paper, we investigate the effects of estimate errors affecting the a posteriori 
probabilities on the optimality of Chow’s mle (Section 2). We show that the optimal 
error-reject trade-off is not provided by Chow’s rale when the a posteriori 
probabilities are affected by errors. In section 3 the nse of class-related reject 
thresholds is proposed. The anthors have proved in [6] that the reject rale based on 
snch thresholds provides a better error-reject trade-off than in Chow’s rale. Section 4 
reports resnlts on the classification of mnltisensor remote-sensing images that point 
ont the advantages of the proposed reject rnle. Conclnsions are drawn in Section 5. 



2 Reject Option with Class-related Thresholds 

As previonsly told, Chow’s reject rale provides the optimal trade-off between error 
and reject, only if the posterior probabilities of the data classes are exactly known. 
This fact can be illnsfrated by an example. Figme 1 shows a simple one-dimensional 
classification task with two data classes a>i and a >2 characterized by Ganssian 
disfribntions. 




866 G. Fumera, F. Roli, and G. Giacinto 




Fig. 1. A one-dimensional classification task with two data classes ©i and ct >2 characterized by 
Gaussian distributions. The application of Chow’s rule with reject threshold T to the “true” and 
“estimated” a posteriori probabilities is shown. 

The terms P{cOi\x) and P{cOi \ x), /-1, 2, indicate the “tme” and “estimated” a 
posteriori probabilities, respectively. We hypothesized that estimate errors are 
negligible when the two classes are “well separated”, that is, when the difference 
between the two a posteriori probabilities is large. Differently, significant errors affect 
the estimated probabilities in the range of featme valnes where the two classes are 
“overlapped”. Other researchers share this assnmption, which is in agreement with 
real experiments [5]. The optimal decision and reject regions provided by Chow’s mle 
applied to the tme probabilities are indicated by the terms Di, D2 and Dq. The term T 
indicates the reject threshold nsed in Chow’s rale. Analogonsly, the terms Dj , D2 , 
and Dq stand for the decision and reject regions provided by Chow’s rale applied to 
the estimated probabilities. It is easy to see that Chow’s rale applied to the estimated 
probabilities never provides the optimal decision and reject regions Di, D 2 and Dq. No 
valne of the threshold T allows to obtain these regions. Therefore, the example in 
Figme 1 points ont that Chow’s rale caimot provide the optimal error-reject trade-off 
when the a posteriori probabilities are affected by errors. The anthors proved the 
general validity of snch conclnsion. For the sake of brevity, the reader interested in 
snch proof is referred to [6]. 

However, a careful analysis of Figme 1 suggests a different approach from Chow’s 
rule for obtaining the optimal error-reject trade-off, even if the a posteriori 
probabilities are affected by errors. First of all, we can observe that the estimated 
decision regions Dj and D2 differ from the optimal ones in the ranges (Di - D^) and 
(D2 - 62) . Accordingly, non-optimal decisions are taken within these ranges by 
Chow’s rule applied to the estimated probabilities. In particular, the patterns 
belonging to the range (D^ - D^) are erroneously accepted, since the a posteriori 
probability P{a>i \ x) takes higher values than the true ones within this range. 
However, it is easy to see that such patterns would be correctly rejected using a 
threshold value T 1 higher than T. Analogously, the patterns belonging to the range 
(D2 - D2) are erroneously rejected, since the a posteriori probability P{co2 \ x) takes 
lower values than the true ones within this range. Such patterns would be correctly 
accepted using a threshold value lower than T. 
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The above analysis suggests the use of multiple reject thresholds to obtain the 
optimal error-reject trade-off, even if the a posteriori probabilities are affected by 
errors. In particular, different thresholds for the different data classes should be used. 
Figme 2 shows the use of two different reject thresholds Ti and T 2 for the 
classification task described in Figme t. 




Fig. 2. Two different reject thresholds Tj and T 2 are applied to the estimated class-posterior 
probabilities of the classification task in Figure 1. Such thresholds allow to obtain the optimal 
reject region corresponding to Chow’s mle applied to the true class-posterior probabilities. 

tt is easy to see that such thresholds applied to the estimated probabilities allow to 
obtain the optimal reject region corresponding to the single-threshold Chow’s rule 
applied to the trae probabilities, tt is worth remarking that Chow’s rale applied to the 
estimated probabilities is not able to provide this optimal reject region. Therefore, 
under the assumption that the a posteriori probabilities are affected by errors, the use 
of multiple thresholds can provide a better error-reject trade-off than Chow’s one. 

The general validity of the above conclusion has been proved in [6]. In particular, 
under the assumption that the a posteriori probabilities are affected by significant 
errors, we have proved that, for any reject rate R, such values of the thresholds 
exist, that the corresponding classifier’s accuracy A{Ti,...,Tm) is equal or 
higher than the accmacy ^(7) provided by Chow’s rule. 

Therefore, we propose the following reject rule for a classification task with N data 
classes that are characterized by “estimated” posterior probabilities P(a)j \ x), 
i=l,...,N. A pattern x is rejected if 

max P(cOk I x) = PicOj I x) < , (10) 

k=l,...,N 

while X is accepted and assigned to the class oh, if 

max Picot: I x) = P((»j I x)> . (11) 

k=l,...,N 

The above thresholds T’l,...,?^ are named “class-related reject thresholds” (CRTs), 
and take on values in the range [0, t] . Accordingly, the proposed rale is named CRT 
rule. It is worth noting that, analogously to Chow’s rule, in real applications, the 
values of the CRTs have to be estimated according to the classification task at hand, 
tn the next section, we describe the basic concepts of an algorithm devoted to estimate 
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such values. For the sake of brevity, we refer the reader interested in more details 
about this algorithm to [6], 



3 An Algorithm for Estimating Class-related Reject Thresholds 

In [6] the authors have proved that the following proposition is true: 

yR 3 T„ T2, .... : A(T„ T^) >A(T) . (12) 

Namely, we have proved that, for any given reject rate R, and the corresponding 
Chow’s threshold T, values of the CRT thresholds exist such that the accuracy 
provided by the CRT rule is equal or higher than in Chow’s rule. It is easy to see that 
such CRT values can be estimated by evaluating the maximum of the function 
A{Ti,...,Tt 4 ) for a given reject rate R. Accordingly, the CRT values that satisfy 
equation 12 are estimated by solving the following maximization problem: 

f max A(7j, .... 7)^) (13) 

.... R^^ 

It is worth noting that the inequality constraint in the above equation is aimed to 
take into account the error-reject requirements of real pattern recognition applications. 
The end-user of a pattern recognition system usually wishes to obtain the highest 
classification accmacy and a reject rate below a fixed threshold Rmax- 

According to the CRT rule, the accuracy and the reject probabilities ^(T’l,...,?^) 
and R{Ti,...,Tt 4 ) are functions of the CRTs. For given values of the CRTs, such 
probabilities can be estimated according to equations 7 and 9 using a validation set. 
Since the functions ^(T’l,...,?^) and R(7’i,...,7’Af) are computed using a finite data set, 
they take on a finite number of values in the range [0,1]. Therefore, equation 13 
corresponds to a constrained maximization problem, where the “targef’ and the 
“constrainf’ functions ^(T’l,...,?^) and R{Ti,...,Tt 4 ) are discrete-valued functions of 
continuous variables. Unfortunately, to the best of onr knowledge, no algorithm 
reported in literatme fits well the characteristics of the above maximization problem. 
Accordingly, we have developed a specially designed algorithm to solve it. First of 
all, onr algorithm takes into acconnt that R{Ti,...,Tt 4 ) is an increasing function of the 
variables T’l,...,?^, that is, the nnmber of rejected patterns cannot decrease for 
increasing valnes of the CRTs. In addition, we assnme that ^(T’l,...,?^) is an 
increasing function of T’l,...,?^. This assnmption is often verified in the experiments. 
According to this assnmption, the basic idea of onr algorithm is to solve eqnation 13 
iteratively, starting from CRT valnes that provide a reject rate eqnal to zero (i.e., 
T^<l/N, i=l,..,N), and varying snch valnes in order to increase the function 
A{Ti,...,Tm). At each step, each threshold T, is increased according to the eqnation 
Ti+kAt, where At is a positive constant, and ^ is an integer varying between 1 and 
kuAx- Then the variations of accnracy AA and reject AR dne to snch changes are 
evalnated. The changes that provide the maximnm positive valne of AA/AR, and do 
not make to exceed the reject threshold Rmax, are selected to generate the next CRT 
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values. The algorithm stops when it is not possible to increase while 

keeping R{Ti,...,Tt 4 )<RMAx- K is worth noting that the proposed algorithm does not 
guarantee to find the optimal solution of equation 13. Neverthless, experimental 
results reported in the next section show that it affords CRT values that provide a 
better error-reject trade-off than in Chow’s rule. 



4 Experimental Results 

The data set used for our experiments consists of a set of multisensor remote-sensing 
images related to an agricultmal area near the village of Feltwell (UK). We selected 
10944 pixels belonging to five agricultmal classes (i.e., sugar beets, stubble, bare soil, 
potatoes, carrots), and randomly subdivided them into a training set (5 124 pixels) and 
a test set (5820 pixels). Each pixel was characterized by a fifteen-element featme 
vector containing the brightness values in the six optical bands, and over the nine 
radar channels considered. More details about the selected data set can be found in 
[7,8]. 

Two different classifiers have been used in our experiments: a ^-nearest neighbors {k- 
rm) classifier and a multi-layer perceptron (MLP) nemal network. For the ^-nn 
classifier, a value of the “C’ parameter of twenty-one was used. The MLP network 
had fifteen input units and five output units, as the numbers of input featmes and data 
classes, respectively. Fifteen hidden nemons were used. 




Fig. 3. The accuracy-rejection trade-offs of the fc-nn classifier using the CRT and Chow’s rules 
are represented on the^-i? plane for values of the rejection rate ranging from 0% to 20%. 
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Fig. 4. The accuracy-rejection trade-offs of the MLP neural network using the CRT and 
Chow’s rules are represented on the A-R plane for values of the rejection rate ranging from 0% 
to 20%. 

According to the algorithm described in Section 2, test data were nsed to estimate 
the valnes of the CRTs and of Chow’s reject threshold. Valnes of the At and Umax 
parameters eqnal to 0.001 and 200, respectively, were adopted. The CRT and Chow’s 
mles were compared in the so-called accnracy-reject plane {A-R plane), introdnced in 
[4]. In the A-R plane, the accnracy-reject trade-offs provided by a given reject mle are 
described by the cnrve A(R) coimecting the points that represent the accnracy valnes 
for different rejection rates. A range of reject rates from 0% to 20% was considered. 
This range is nsnally the most significant for application pnrposes. 

Figme 3 shows the accmacy-reject trade-offs provided by the ^-nn classifier nsing 
the CRT and Chow’s mles. The resnlts are related to the test set and they are shown in 
the A-R plane. It is worth noting that, for any valne of reject rate, the accmacy 
provided by the CRT rale is higher that in Chow’s rale. Accordingly, we can say that 
the CRT reject rale provides a better error-reject trade-off than in Chow’s rale. Figme 
4 shows the resnlts related to the MLP nemal network. It is easy to see that 
conclnsions similar to the ones of the experiment with the ^-nn classifier can be 
drawn. 



5 Conclusions 

In this paper, we addressed the problem of the optimality of Chow’s reject rale when 
the a posteriori probabilities are affected by estimate errors. We showed that Chow’s 
rale cannot provide the optimal error-reject trade-off if significant estimate errors are 
present. We then proposed the nse of class-related reject thresholds. The anthors have 
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proved in [6] that the related reject rule provides a better error-reject trade-off than in 
Chow’s rule. Reported experimental results confirmed the proposed theory. Finally, it 
is worth noting that the use of class-related reject thresholds was previously proposed 
for different purposes by Yau and Manry [3]. They have shown that such multiple 
thresholds allow to equalize the error and reject probabilities for different data classes. 
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Abstract. Clustering multivariate data that are contaminated by noise 
is a complex issue, particularly in the framework of mixture model esti- 
mation because noisy data can significantly affect the parameters estima- 
tes. This paper addresses this problem with respect to likelihood maxi- 
mization using the Expectation-Maximization algorithm. Two different 
approaches are compared. The first one consists in defining mixture mo- 
dels that take into account noise. The second one is based of robust 
estimation of the model parameters in the maximization step of EM. 
Both have been tested separately, then jointly. Finally, a hybrid model 
is proposed. Results on artificial data are given and discussed. 

Keywords: Clustering, Expectation-Maximization, Robustness, M-esti- 
mation 



1 Introduction 

Clustering techniques are successfully applied in many areas and some software 
can be now downloaded from the Internet, e.g. EMMIX by G. McLachlan and 
al. MCLUST by C. Fraley and A. Raftery |S|. It aims at describing relati- 
onships between objects in order to group them in homogeneous clusters. Let 
X = xi,X 2 , ■ ■ ■ ,xn be an observed p-dimensional random sample of size N. In 
mixture model theory, each Xk {k = 1,N) is assumed to be a realization of 
a p-dimensional random vector X with the C-components mixture probability 
density function (pdf): 



where f{x; &i) denotes the p-dimensional pdf of the component and pairs 
(7Ti,0i) {i = 1,C) are the model parameters. A priori probabilities sum up 
to one. If a normal mixture model is assumed, 0i = {ni,Xi)'^ with mean p,i 
and covariance matrix Si. Assuming independent features of X, the model pa- 
rameters 0 = (tti, . . . , TTc, 0f, . • . , 0'c)^ can be estimated by maximizing the 
likelihood C{0) using the Expectation-Maximization (EM) iterative algorithm 

F.J. Ferri et al. (Eds.): SSPR&SPR 2000, LNCS 1876, pp. 872-|^^ 2000. 
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( 1 ) 
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due to Dempster, Laird and Rubin 0: 



N C 



£(0) = p(xi0) = n 



(2) 



k—1 i—1 



Resulting clustering is not robust, i.e. it is too much sensitive to outliers or 
noise. In this paper, we address the problem of clustering noisy data. To face 
such a problem, one can choose either to perform robust estimates or to use 
more complex mixture models. In section 2, we briefly recall the EM algorithm. 
Robust M-estimates that can be used are presented in section 3; and we focus 
on the normal case. Next, we present a comparative study of both strategies on 
artificial two-dimensional data. 



2 EM Algorithm 

In clustering problems, observed data can be regarded as being incomplete data 
because the labelling is unknown. Complete data = {xk,Zk) can be defined 
by introducing for all observation Xk the realization Zk = {zki, • ■ • , Zkc) of a C- 
dimensional random variable Z representing the labels of Xk, i.e. Zki is equal to 
1 when Xk arises from the component and 0 otherwise. Then, the maximiza- 
tion of the likelihood 0 can be replaced by an easier one, namely the complete 
likelihood Cc{0) = P{X,Z\0) maximization. This is achieved by the EM al- 
gorithm that performs iterative maximization of the complete-data likelihood 
expectation: 

Q(0;0«) = E[£,(0)|x,0(‘^] (3) 

where (t) is an iteration index. Each EM iteration consists of two steps. Com- 
putation of <5(0; 0^*^) corresponds to the so-called E-Step (Expectation Step). 
Assuming independent Zk, the complete-data log-likelihood is: 



N C 



iog(£,(0)) = EE Zki log(7ri/(xfe; 0i)) 



k—1 i—1 



Therefore: 



N C 



Q(0;0(‘)) = EE E[zki\x, log(7Ti/(xfe; 0 i)) 



( 4 ) 



( 5 ) 



k—1 i—1 



and the E-Step reduces to estimating E[zki\X:0^*^]- Let Zki be this estimate. 
The second step M-Step (Maximization Step) of EM consists in finding the 
value of 0 that maximizes Q(0; 0^*^): 



0(*+L =argmaxQ(0;0(*)) 



( 6 ) 



The EM algorithm increases monotonically the likelihood (see 0 by C.F.J. Wu 
for details). 
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When a p-dimensional normal mixture model is assumed, the parameters of each 
component as well as the prior probability are iteratively estimated 



using Zki = 



Trif{xk',0i) 



.(t+i) 



tt; = 



^ki 



N 



(7) 



= 



2^k—l 
Z^k=l ^ki 



(t-tl) _ l^k=l Zk^[Xk - ')[Xk-fxl ') 






Z^fc=l ^ki 



( 8 ) 

(9) 



3 Robust Estimation 

Let a = (oi, . . . , Gm) be a parameter to be estimated within a sample. Let 
be the difference between an observed Xk and its predicted value Xk, namely an 
error. Ck is a realization on a random variable e whose probability distribution 
is J. Assuming independent samples, the likelihood to be maximized is a product. 
Optimal a can be obtained by minimizing the following cost function: 

c(„) = y;p(ii-Ah^)) do) 

^k 

k—l 

where p = log(J“^) and ak is a weighting factor. This is achieved by solving the 
differential equations: 

dC{a) 1 Xk - Xk{a) . dxk{a) 

daj ^ o-fe ^ (Tk daj 

where ipix) = ^{x). Different M-estimate models are shown in Table Q where 
w{x) = is a weight function increasing as the error decreases. 



Table 1. Different M-estimates 



Model 


p{x) 


4>{x) 


w(x) 


Legendre 




2x 


2 


Median 


|a;| 


sgn{x) 


1 

|x| 


Cauchy 


^log(l + (ff) 


X 


1 


i+(|F 


i+(fF 


Huber 0/ 


f ^ if |a;| < c 


( X ii \x\ < c 


J 1 if |x| < c 


c\x\ — ^ else 


1 c sgn{x) else 


t R else 
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Such models can be used when estimating the parameters of the com- 

ponents of a normal mixture by EM Equation 0 is simply replaced by the 
following procedure: 



— T = 0 

~{t + l,T) _ .(t + l) 

Mi “ Mi 

^(t+l,r) _ ^{t+1) 



— repeat \fk = 1, N : ek = {xk — E,^ 



(t+l,r)NT f,-l(t-|-l,T) 



^(Cfe) = 



_ b(efc) 



(Xfe - 



(t+l,r) ^^k—1 Xk 



Ml 



Y.k=i w{ek) Zki 



b-ti.r) _ Ef=i 'w{ek) Zki {xk - p>r"'''’){xk - 



(4-|-1,t)n 






T <— T + 1 

until reached bound 



Y.k=i w{ek) Zkz 



( 12 ) 

(13) 



It is worthy of note that the property of monotonous increase of the likelihood 
is lost by EM in case of robust estimation. The model parameters are no more 
updated by solving = 0. However, these estimates (0-1(3) are good 

initial values for the robust ones (Ha - (H3 . 

Since the weight functions w are monotonous decreasing functions of the error 
Cfc, the more r, the more robust but the less precise estimates are provided. 
Therefore the number of iterations is bounded, e.g.: 



— T 

lA 



^ Xjnax 

(t + l,x + l)_-(t + l,- 



-(t+l,x) 



< e 



— rate of samples having a quite zero weight a < ar, 



We have combined the first and the third conditions in our experiments. 



4 Mixture Models 

In this study, we have used normal and uniform components for computation 
convenience. Let Af denotes the p-dimensional gaussian pdf and U the uniform 
one defined on a given hypercube H . Here are the more or less complex mixture 
models that we have tested: 

1. C normal components where the parameters are estimated via O-(0: 

c 

f(x; 0) = ^TT, Af{x] E,) 

2=1 



(14) 
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2. C normal and one uniform components, 7 being user defined: 

C 

f{x; O) =-fU{x;H) + '^ m Af{x; iii,Si) (15) 

i=l 

3. C normal components where robust estimates (1 1 211 - II I dll are used: 

C 

f{x;0) = Af{x;Hi,S,) (16) 

i=l 

4. C normal and one uniform components with robust estimates: 

C 

f{x; 6>) = 7 U{x] iJ) + ^7Ti Af{x-,pLi, Si) (17) 

1=1 

5. C normal components with robust estimates and one additional mixture: 

c+i 

/(x;0) = ^ ^T^f{x; O,) 

i=l 

/(x;0i) = (1 -7 j) A/'(x;/ti,27j) Vz=l,C 
c 

f{x-,0c+i) = ^li M{x-,^ii,aiSi) 

i=l 

In this latter model we propose, each of the C components is a linear combination 
of two normal pdf. The first one intends to track cluster kernel points while 
the second one is supposed to deal with surrounding outliers via multiplicative 
coefficients All these second modes are summed up to compose a (C + 1)*^ 
component. The combination coefficients 7^ are user-defined as well as a^. 

Our model differs from previous work, e.g. G. Mac Lachlan and D. Peel |S| or Y. 
Kharin [7), in mixing robust estimation of the parameters in (P|l and classical 
estimation of ones in (EO). 

5 Experiments 



(18) 

(19) 

( 20 ) 



We have generated two data sets in order to test the robustness of the pre- 
sented mixture models. Both consist of three 2-dimensional gaussian classes and 
uniformly distributed samples supposed to be noisy points, as shown on figure [D 
In the first set the classes are well-separated while they strongly overlap in the se- 
cond one. Noisy patterns represent respectively 30% and 16.7% of each data set. 
Table El summarizes the theoretical parameters of the classes for both generated 
datasets. The correlation coefficients that describe the clusters orientations 
are given. As most of partitioning methods for clustering data, the number of 
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Fig. 1. Data sets (left) and ^2 (right) 



mixture components has to be set. One can assess this value |2| but we prefer- 
red to choose it manually (C = 3) in order to concentrate on robust estimation 
effect. The fitted parameters provided by EM significantly depend on initial va- 
lues. For each model and each data set, we have run EM 50 times with same 
random initial values fli and identity matrices for covariance matrices. We also 
have tried different values of the coefficients involved in the different models and 
kept the best results according to the maximum posterior probability criterion 
for the only gaussian points. Only the Cauchy M-estimate has been tested. In 
addition, the resulting cluster correlation coefficients are estimated. 



Not surprisingly, all the models give a similar low error rate value on data 
set #1 (see Table E|). As the gaussian classes are well separated and as the noise 
rate is low enough (30%) the optimal error rate has been obtained for a low 
value of the mixing coefficients 7 or 7 * (models #2,^4 or #5). Therefore, robust 
estimation has a not a significant action. The Cauchy parameter c whose value 



Table 2. Data sets parameters 




878 



C. Saint-Jean, C. Frelicot, and B. Vachon 



Table 3. Data set - Results (final estimates values and errors) 



Ml 

1^1 

ri 



Model #1 
( 7.07 \ 

/ 5.32 -1.69\ 
\^-1.69 2.28 ) 

-0.48 



/6.07 
\^2.74 
:.46 - 
- 0.01 
- 0.00 



Model #2 
/7.18' 
18.87, 

/ 2.49 -1.59 \ 
(^-1.59 1.56 ) 
-0.81 



\^2.73 ) 

( 4.06 0.08 \ 
1^0.08 1.04 j 



Model #3 

TtItV 
\^ 8.86 ) 

( 0.89 -0.63 \ 
\^-0.63 0.59 ) 

-0.87 



Model #4 

jTiiy 

\^8.84 ) 

( 1.72 -1.2\ 
1^-1. 2 1.17 ) 
-0.87 



/ 6.09 ' 

V2-73, 

3.31 0.00 ' 
^0.00 0.89, 

- 0.00 



Model #5 

\^ 8.86 ) 

( 2.77 -1.36\ 
(^-1.36 1.47 ) 

-0.67 



M2 

1^2 

T2 

M3 

i:'3 

1~3 

E 



( 4.46 -0.01 \ 
(^-0.01 1.13 ) 



6.21 
2.73, 

/ 1.77 -0.01 \ 
\^-0.01 0.41 j 

- 0.01 



6.09 
2.73, 

( 3.45 -0.00 \ 
\^-0.00 0.90 j 

- 0.00 



/ 2.77 \ 
\^6.90 j 

( 1.72 -0.38\ 
l^-0.38 1.26 ) 
-0.26 
3.43% 



( 2.98 \ 
\^7.00 ) 

( 2.05 -0.37\ 
\^-0.37 1.66 ) 



~f~2.99 ' 
^7.06, 

/ 1.03 -0.08 \ 
\^-0.08 0.76 J 
-0.09 



( 2.97 \ 
7.03 J 

( 1.78 -0.2 \ 
\^- 0.2 1.22 J 

-0.13 



3.43% 



/ 2.86 \ 
\^6.99 ) 

( 1.56 -0.31 \ 
\^-0.31 1.12 ) 



3.43% 



2.% 



2.86% 



describes the number of observed points contributing to the robust estimates 
compensates a low value of the mixing coefficients (if used) . The smaller 7 or 7 ^ 
are, the less noisy points are modelled. So the smaller Cauchy’s parameter c is in 
order to filter enough. The means are very close to the theoretical ones whatever 
the model is. On the other hand, taking the noise into account (models #2, #3, #4 
and ^5) clearly improve the clusters shapes and orientations as reflected by the 
obtained covariance matrices and correlation coefficients. The optimal partition 
we obtained with model #3 is shown on Figure El (left hand-side). Obviously, 





Fig. 2. Data set clustering with models ^3 (left) and ^5 (right) 
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all the outliers have been incorrectly clustered, the few ones in the upper right 
corner in particular. This can be explained by the fact that original noisy points 
do not enter into the error rate computation. Clusters resulting from the model 
we propose (#5) are shown on Figure |21 (right hand-side). 



Table 4. Data set ^2 - Results (final estimates values and errors) 



Model #1 



[ 3.97\ 
\^5.06 J 
( 5.60 0.03 \ 
1^0.03 4.54 ) 



Model #2 



Model #3 



/2.84\ ( 2.74 \ 

\^4.6SJ 

/ 1.45 -0.3S\ / 0.96 -0.37\ 

(^-0.38 3.14 ) \^-0.37 1.85 ) 



Model #4 



/ 3.02 \ 
\^5.20 ) 

/ 1.26 -0.10\ 
\^-0.10 3.38 ) 



Model #5 



7 ^^ 

\^-0.17 3.06 ) 



Ml 

ri 



/ 5.99 \ / 6.03 \ 

\^8.94j \^9.04j 

/ 1.70 0.95\ / 1.07 0.50 \ 

1^0.95 1.40 ) 1^0.50 0.81 ) 



/ 6.09 \ 
\^9.05 ) 

( 1.29 0.69\ 
1^0.69 1.07 ) 



M2 

^2 

T2 

M3 

^3 

E 



6.06 
9.13^ 

/ 2.02 1.19 \ 
1.19 1.41 J 

0.70 



6.10 
9.06^ 

/ 1.24 0.56 \ 
1^0.56 0.95 ) 



0.52 



/ 5.68 \ 
^5.17j 
/ 5.58 -5.4\ 
\^-5.4 5.53 ) 

-0.96 

15.33% 



( 5.61 \ ( 5.45 \ 

(^5.30 ) (^5.45 ) 

( 3.80 -3.37\ / 

\^-3.37 3.45 ) 



/6.40\ 
1 ^ 4.65 ) 

( 1.90 -1.82 \ 
(^-1.82 2.06 ) 

-0.92 

10.67% 



/6.30\ 

I4.74J 

/ 2.13 -2.01 \ 
\^-2.01 2.27 j 



-0.93 

12.67% 



3.00 -2.65 
-2.65 2.70 

-0.93 

14.22% 



11.33% 



When faced to much more overlapping clusters (data set #2), the parameter esti- 
mation process is strongly disturbed, the means being attracted by the dense 
areas, as shown in Table E] As expected, robust estimates tends to correct this 
trend, i.e. the obtained values are closer to the theoretical ones (models #3, #4 
and #5). Consequently, the optimal Cauchy parameter values c are smaller than 
those selected on data set #1. Figure Elshows the partitions that we have obtai- 
ned with the best models involving robust estimates according to the parameter 
fitting and error as well (model #4 on the left hand-side, model #5 on the right 
hand-side). 



Table 5. Results over the 50 runs 





Error 


Model #1 


Model #2 


Model #3 


Model #4 


Model #5 


Data set^l 


Mean 


8.31% 


9.87% 


9.86% 


11.87% 


2.96% 


StDev 


10.87 


12.57 


15.18 


15.15 


0.36 


Data set^2 


Mean 


18.82% 


13.27% 


16.76% 


13.95% 


11.93% 


StDev 


5.8 


4.27 


5. 


6.97 


0.2 
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In order to test the robustness to initial center locations, we have chosen the 
optimal parameter setting with respect to the maximum posterior probability 
criterion. Table 0 summarizes the means and standard deviations of the error 
we have obtained over the 50 different runs. According to a mean value close to 
the minimum one and a low standard deviation, our model (#5) outperforms all 
the others on both datasets. We think that the lower sensitivity of this model 
to initialization can be explained by the introduction of normal subcomponents 
that softens the tails of the resulting component. The ability of the model 5 to 
perform good clustering in spite of a bad initialization suggest us that it would 
be useful in many real situations. 



6 Conclusion 

In this paper, we have compared two different approaches to clustering multi- 
variate data in the context of mixture of components likelihood maximization 
with the EM algorithm. Indeed, such algorithm often fails in finding accurate 
parameters when the data are mixed with noisy data. So, one can either take 
noisy data into account when defining the mixture model or use robust estima- 
tion techniques. We have noticed that both approaches can improve the results 
whatever the separability of the clusters is. Furthermore, in case of strong over- 
lap, their joint use give better results. We have proposed such a model whose 
performances in terms of misclassification as well as accuracy of the parameters 
estimates are satisfactory. Moreover, we notice that this model is very robust to 
different initializations. Further investigation will concern the automatic selec- 
tion of some coefficients involved in this model. 





Fig. 3. Data set ^2 clustering with models ^4 (left) and ^5 (right) 
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Abstract. The method of stochastic discrimination (SD) introduced by 
Kleinberg f [ti:IY) iis a new method in pattern recognition. It works by pro- 
ducing weak classihers and then combining them via the Central Limit 
Theorem to form a strong classifier. SD is overtraining-resistant, has a 
high convergence rate, and can work quite well in practice. However, 
some strict assumptions involved in SD and the difficulties in understan- 
ding SD have limited its practical use. In this paper, we present a simple 
algorithm of SD for two-class pattern recognition. We illustrate the al- 
gorithm by applications in classifying the feature vectors from some real 
and simulated data sets. The experimental results show that SD is fast, 
effective, and applicable. 



1 Introduction 

Suppose that certain objects are to be classified as coming from one of two 
classes, say class 1 and class 2. A fixed number of measurements made on each 
object form a feature vector q. All the feature vectors constitute a finite feature 
space F C R ■ We can classify an object after observing its feature vector q with 
the aid of the classification rule of SD and a training set TR = {Ti?i,Ti? 2 }, 
where TRi is a given random sample from class i. On an intuitive level, the idea 
of SD is similar to how people learn: People learn new knowledge and strategies 
step by step. After years of learning, the knowledge and strategies (or skills) 
that they have accumulated will enable them to tackle complicated tasks. On a 
precise mathematical level, the procedure of SD is outlined as follows. 

Step 1. Use the training set and rectangular regions to produce t weak clas- 
sifiers where t is a natural number. See Section 2. 

Step 2. For any given feature vector q from F, calculate the average 

gt) ^ X(g,^W) + X(g,5(^)) + ... + A(g,5W) ^ 

where A'(-, •) is a base random variable defined later in Section 3. 

Step 3. Set a level t classification rule as follows: if Y{q, S*) > 1/2, classify q 
into class 1; otherwise classify q into class 2. See Section 4. 
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The above procedure follows the idea in jjj. We will study these steps in 
Sections 2-4. 

SD is characterized by the properties of overtraining-resistance, a high con- 
vergence rate, and a low misclassification error rate (see Pj and [Zj). This will 
be shown by examples in Section 5. The underlying ideas behind SD were in- 
troduced in jS|. Since then, a fair amount of research has been carried out on 
this method, and on variations of its implementation. See, for example, P> P, 
0. 0, 0 and P]. And the results have convincingly shown that stochastic 
discrimination is a promising area in pattern recognition. 



2 How to Produce Weak Classifiers 

We produce weak classifiers through resampling. In fact, a weak classifier is s 
finite union of rectangular regions which satisfies some coverage condition. In 
this context, a rectangular region in i? is a region of the form 

{ {xi,X 2 , ■■.,Xp) \ai<x,<b„ for i = 1,2, . . . ,p } , (2) 

where ai and bi are real numbers for i = 1, 2, . . . ,p. Let 5ft be an appropriate 
rectangular region in R which contains F . We will utilize those rectangular 
regions in whose “width” bi — Ui along the s^-axis is at least p times the 
corresponding width of 5ft, where 0 < p < 1 is a fixed constant. The coverage 
condition is related to a ratio variable r. For any subsets Ti and T 2 of F, let 
r(Ti, T 2 ) denote the ratio of the number of common feature vectors in T\ and T 2 
and the number of feature vectors in T 2 . For example, if T 2 contains 5 feature 
vectors and Ti and T 2 have 3 feature vectors in common, then r{Ti,T 2 ) = 3/5 = 
0.6. It is seen that r(Ti,T 2 ) represents the coverage of the points in T 2 by Ti. 

Now we can define a weak classifier. Roughly speaking, a weak classifier is a 
finite union of rectangular regions such that the coverage of the points in TRi 
by the union and the coverage of the points in TR 2 by the union are different. 
Strictly speaking, let /3 be a fixed real numbers with 0 < /3 < 1, then an S is 
said to be a weak classifier if S' is a union of at most k rectangular regions in R 
which satisfies |r(S, Ti?i) — r(S, Ti? 2 )| > /3- The condition r(S,TRi) yf r{STR 2 ) 
simply states that S can actually be used as a (very weak) classifier. To illustrate 
this, consider an S, which contains 40 sample points from TRi and 60 from 
Ti? 2 - Then r(S,TRi) — 40/ni and r{S,TR 2 ) = 60/n2. Suppose 40/rij > OO/n^. 
That is, the coverage of the points in Ti?i by S is greater than the coverage of 
the points in TR 2 by S. Let S'^ denote the complement of S in F. Then since 
1 — 40/rij < 1 — 60 /ri 2 , one sees that the coverage of the points in TR 2 by is 
greater than the coverage of the points in TRi by S'^. Thus intuitively we could 
use S to classify F by deciding that any sample point q from S belongs to class 
1 and any other sample point q from S'^ belongs to class 2. This of course gives 
a (very) weak classifier. 
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3 Base Random Variables 



To connect weak classifiers with feature vectors, we need a base random variable 
X(-, •). Given a feature vector q and a weak classifier S, the value of X is defined 
to be 



X{q,S) 



ls{q)-riS,TR^) 
r{S,TRi)-r{S,TR 2 ) ’ 



( 3 ) 



where ls(g) = 1 if g is contained in S and 0 otherwise. in ® can be 

understood as the standardized version of ls(g), which gives the simplest way 
to connect the weak classifiers S and feature vectors q. 



4 Classification Rule 

Let S* = . . . , be a random sample of t weak classifiers. For any 

q G F, define Y{q, S*) as stated in (PJ. Under some mild conditions, the Central 
Limit Theorem can be used to show the following fact. If t is large enough then 
there is a high probability that F(g, S‘) is close to 1 for any q from TRi and 
close to 0 for any q from TR 2 (see Theorem 1 in 0.). Hence one can define the 
following 

Level t Stochastic Discriminant Classification Rule: For any q G F, if 

Y{q, S*) > 1/2, classify q into class 1, otherwise classify q into class 2. 

5 Experimental Analysis 

In this section, we report the experimental results on classifying feature vectors 
from several problems. The emphasis will be placed on normal populations. The 
comparison of SD with non-SD methods is also given. 

Example 1 (Two normal populations with equal covariance matrix ). Consider 
two distributions I) (class 1) and N{fi 2 , I) (class 2), where I is the 

2x2 identity matrix, pii is the vector (1.5,0)', and /i.2 is the vector (0,0)'. 
Both of the prior class probabilities tti (for class 1) and tt 2 (for class 2) are 
equal to 1/2. The training set contains 400 points from each class, and test set 
contains 1000 points from each class. Let 5Ri be the smallest rectangular region 
which contains both training and test data. Suppose A > 1. Let denote the 
rectangular region similar to whose center is the same as that of 5l?i and 
whose “width” along the aj^-axis is A times the corresponding width of . We 
regard Ka as our K defined in Sect. 2. For the resampling process, A = 1, p = 0.3, 
(3 = 0.52, and k = 5. The test error from SD is below 23.55% when the level 
t > 4700. See Fig. 1 for the performance of SD. From the figure, we see that 
both training and test errors start to decrease at the beginning and then quickly 
level off, forming two “parallel curves”, as more weak classifiers are added. This 
phenomenon is common for SD classification procedures. As a comparison, the 
linear discriminant rule yields a test error 23.15%. 
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Fig. 1. Training and test errors for the classification with two normal distributions 
having the same covariance matrix. The training set contains 400 points from each 
class, and the test set contains 1000 points from each class. 



Example 2 (Classifying Alaskan and Canadian salmon). Here we will show the 
performance of SD on the salmon dataset from Johnson and Wichern (jS])- The 
original data contain the information of gender, diameter of rings for the first- 
year freshwater growth, and diameter of rings for the first-year marine growth 
for 50 Alaskan salmon and 50 Canadian salmon. We treat both freshwater and 
marine growth ring diameters as the features of salmon. 

Johnson and Wichern (|5|) note that the data appears to satisfy the assump- 
tion of bivariate normal distributions, but the covariance matrices may differ. 
Thus the usual quadratic classification rule may be used to classify the salmon. 
Using the quadratic rule and equal prior probabilities, the error rate from a 
10-fold cross-validation is then 8%. 

To apply SD, we set (A,p, /3, k) = (1.005,0.1,0.5,5). From the same data 
sets as those used with the quadratic rule, the 10-fold CV error rate from SD is 
virtually 9%. See Fig. 2 for the details. 

Other comparisons of SD with non-SD methods are also available. For ex- 
ample, 0 considers the classification for the Pima Indians diabetes dataset de- 
scribed in PJ. The test error from SD is actually identical to the best result 
in P. Section 3 of |3 reports one experiment on handwritten digit recognition 
and another experiment on classifying boundary types in DNA sequences. In the 
first experiment, SD is compared with a nearest neighbor algorithm, a /c-nearest 
neighbor algorithm, and a neural network. In the second experiment, SD is com- 
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Fig. 2. The averaged training and test errors for classifying Alaskan and Canadian 
salmon. The errors are based on a 10-fold cross-validation procedure. 



pared with more than 20 different methods. In both cases comparisons show that 
SD yields the best test set performance. 

Notes. When applying SD to a dataset, we need the values of A, p, k, and 
13. From Sect. 2, we know that these 4 parameters together determine weak 
classifiers. Since the quantitative relationship among these parameters is not 
available, we can apply SD to the training set alone to find out the combination 
of these parameters with which the misclassification rate for TR is minimum. 
Thus, we propose the following two-step procedure. First, we run SD over TR 
by stepping through the range of these parameters and find out the combina- 
tions corresponding to the best achieved TR performance. Since SD has an 
exponential convergence rate (|2|), this step is practical. In fact, usually we can 
obtain several satisfactory combinations and we choose the one with which SD 
runs fastest. Then, we apply SD to both training and test sets with the selected 
parameters. 

Acknowledgments. The first author is very grateful to his advisor Professor 
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Abstract. A structure-adaptive approach to robust statistical estimation of 
image intensity for adaptive filtering and segmentation of images is described in 
the context of a two-region structural image model. The proposed adaptive 
estimation procedure is based on the selection of best fitting structuring region 
relatively to a current point from all available multiple structuring regions by the 
maximum a posteriori probability principle. In application to image filtering, 
the described method allows to suppress noise and, at the same time, not 
damage the initial image including comer edges and image fine details. It 
provides also a robust binary segmentation of local objects of interest and their 
edges on noisy background. 



1 Introduction 



The main objective of image filtering is the suppression of different types of noise 
present in images as well as the enhancement of imaged objects, edges and other 
relevant details. The result of filtering is often used for reliable image segmentation, 
e.g. for detection of local objects of interest. A variety of filtering and intensity 
estimation techniques have been proposed for this purpose. However, when solving 
the considered problems, the known filtering and segmentation techniques often 
evaluate image intensity incorrectly, that is mostly due to ignoring shape constraints 
of objects of interest and noise statistics which can help in discriminating local objects 
or fine details against noisy pixels [1-4]. For example, the median filter does not blur 
edges, but it damages fine details including corner edges and small and thin isolated 
objects. Some adaptive methods solve this problem in an edge-preserving manner but 
they do not suppress noise at the edges and fine details in images. 

Known robust statistical estimators can be applied in the case of assumed noise 
model with outliers in order to achieve better noise removal. The noise model with 
outliers supposes the noise magnitude follows one particular distribution law, e.g. a 
normal distribution, except for a small amount of outliers with another distribution. 
Recently, several image filtering and segmentation methods which are based on 
statistical estimation of image intensity using the concept of robust regression have 
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been proposed [4-6]. Although good results have been reported, the advantage of 
mixed noise removal in these methods does not contribute to the preservation of 
relevant local structures. 

To overcome the mentioned disadvantages of the known nonlinear methods, it is 
proposed to use a structure-adaptive approach to image filtering which consists of 
using multiple structuring regions for robust estimation of image intensity or other 
image local properties involving, first of all, pixels from the most homogeneous 
regions [4]. This approach is based on a model of image local structures which 
explicitly describes planar shape of local objects using notions and operations of 
mathematical morphology. Another important application of such an image modeling 
is the image segmentation based on a robust estimation of model parameters. 

The paper is organized so, that after the problem statement in Section 1, the 
underlying model of image intensity function is described in Section 2. The structure- 
adaptive image filtering and segmentation using sub-sample selection by the 
maximum a posteriori probability principle for intensity estimation is described in 
Section 3. Experimental results of filtering and binary segmentation are described in 
Section 4 and concluding remarks are given in Section 5. 



2 Image Structural Model Using Polynomial Regression 

2.1 Modeling of Homogeneous Regions on the Image Plane 

The so-called separate image modeling has been used in order to represent local 
objects in images and edge structures of present objects (homogeneous regions) as 
local geometrical structures [4]. It consists of two kinds of image modeling: domain 
model of homogeneous regions and image intensity model as a stochastic 
representation of intensity variations. The domain image modeling consists of a 
description of object support regions as sets of points on the image plane by means of 
generating sets and structuring elements [4]. The intensity function has constant 
parameters inside the support regions which are created by using morphological 
operation of dilation of generating sets (lines), where the generating set is a set of 
four- or eight-connected image points the width of which being equal to one. The 
generating set is a kind of object’s skeleton which together with structuring elements 
determines the shape and size of objects of interest on the background. 

The defined structuring elements and generating sets are used to obtain the so- 
called structuring regions which are involved in structure-adaptive image filtering 
and segmentation [4]. The value of sample size N defines statistically sufficient 
number of pixels for estimation of model parameters, and ultimately, the image 
intensity. For a structuring element, all the related structuring regions |V,(y)} that 
constitute a multiple local domain model, are derived from it by selecting at least N 
nearest neighboring points relatively to every point of the structuring element. In the 
case of multiple structuring elements, the same derivation procedure is made with 
respect to each of the structuring elements with subsequent reduction of redundant 
structuring regions, namely, the regions which include other smaller structuring 
regions. The structuring region which corresponds to the central point of a symmetric 
structuring element is called a symmetric structuring region, all other structuring 




890 R.M. Palenichka, P. Zinterhof, and I.B. Ivasenko 



regions represent edge structuring regions. Examples of structuring regions with 
various shapes are shown in Fig. 1. 



m 




(a) (b) 



m 



(C) 



Fig. 1. Examples of two-region image fragments with different edge geometry. Appropriate 
structuring regions are marked by the black quads. 



It can be easily proved that in the condition of the assumed domain model every 
image point belongs, at least, to one of the multiple structuring regions since every 
domain point belongs to a certain structuring element shifted to a respective point on 
the image plane. Estimation of a pixel intensity by considering only edge pixels of an 
appropriate asymmetric neighborhood preserves corner edge pixels against 
destruction whereas the median filtering, for example, destroys them. 



2.2 Polynomial Regression Model of Intensity Function 

A parametric function is defined for each region of the domain model in order to 
model the image intensity function so that the function parameters are constant within 
every single region in the image domain. The polynomial regression model has been 
adopted for a concise intensity description within homogeneous regions defined in the 
image domain model. The polynomial regression model states that the intensity 
function g(i,j), where (i,j) are two coordinates as the non-random explanatory 
variables (due to exact discrete values of coordinates), can be represented as a 
polynomial function of order q within a neighborhood D(u,v) relatively to a current 
point (m,v) plus independent noise field n(i,j) with a Gaussian distribution N(0,1): 



g{i,j) = <3-n{i,j)+ ^0^^^ V(i,y)e Z9 (m,v), (2.1) 

r-\-s<q 



where D(u,v) might be a homogeneous region in the image plane or a structuring 
region as a subset of this homogeneous region including current point (u,v), is the 
(r,s) regression coefficient, n(i,j) is white noise having zero mean and unit standard 
deviation, cr is the scale parameter for model residuals (noise). The first term of the 
right part in Eq. (2.1) is treated as residuals of the polynomial regression model. The 
regression coefficients { ^ J are considered together with the scale parameter eras the 
model parameter vector 0=[(^, ..., 6j, o] to be estimated, where ft is the surface 

intercept coefficient. 
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The resulting image local model after consideration of the introduced regression 
model will be a piecewise polynomial representation of image intensity within a local 
fragment, the size of which cover all possible positions of the structuring regions 
used. In some problem statements of image processing, especially in binary 
segmentation, a two-region local model with assigned two sets of regression 
coefficients can be assumed as a quite adequate representation of image intensity (see 
Fig. 2). For example, such a two-region model is explicitly supposed in image 
binarization by a thresholding operation. This local model of an image fragment 
represents the generalization of a step or ramp edge model with various planar shape 
of edges. The intensity of a two-region image fragment is described by two vectors of 
regression parameters: 8={9^, 6{, ..., d\ and 1J=[JJ„, ?7,, ..., ?7,, d\- The case of equal 
polynomial coefficients except for the intercept coefficients represents the so-called 
conformable polynomial regression (CPR) model in which the difference c=l T]„\ 
determines the local contrast of a two-region fragment. 



3 Algorithms for Structure- Adaptive Filtering and Binarization 

3.1 Robust Estimation of Model Parameters 

The method of least squares (LS) provides an optimal solution for estimation of 
model parameter vector, 0={ 6j, ..., di, in the sense of maximum likelihood (ML) 

principle in the condition of normal distribution of model residuals <7-n(i,j). However, 
not all the residuals in Eq. (2.1) have a Gaussian distribution in the neighborhood 
W(i,j) due to possible present edges and fine details in this neighborhood according 
the assumed model of intensity (Section 2.2). The pixels belonging to other side of 
present edge in W(i,j) are considered, for example, as outliers which can take 
arbitrary large values within the gray scale. In this case, the methods of robust 
regression can overcome the deficiency of LS intensity estimation. Different 
approaches to robust parameter estimation have been developed including the well 
known M-estimators and L-estimators [7]. The known method of least median of 
squares (LMS) is applicable in the case of general regression and has a breakdown 
point of 50% [8]. The breakdown point of a regression estimation method is the 
smallest amount of outlier contamination that may force the value of the estimate 
outside an arbitrary range. 

The proposed approach to robust estimation is based on the maximum a posteriori 
probability (MAP) selection of best partial estimate among all the computed partial 
estimates with respect to different pixel sub-samples in W(i,j). Every partial estimate 
vector 9^. is computed by the maximum likelihood estimation (ML) principle that 
corresponds to the least square method in the case of assumed Gaussian residuals 
(inlier points) in the polynomial regression model by Eq. (2.1). It is supposed K 
distinct cases of outlier positions with respect to the inliers with respective prior 
probabilities {P(A:)} of these occurrences. If the prior probabilities {P(k)} are all 
known and correct, then the outlier occurrence jU for the final parameter estimate is 
selected by the MAP rule: 

/7 = argmax{P(k)-P(|,/k)}, (3.1) 
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where / k) is the conditional probability of residuals for the sub-sample ^ 
corresponding to the outlier occurrence k. In the case of impulsive noise, for example, 
the binomial distribution is suitable for {P(k)}. 

However, the direct application of the MAP rule for the best fitting sub-sample 
selection by Eq. (3.1) needs the marginal probability density functions P(^i^ / k) to 
be known for all possible values of the parameter vector to make such a selection. It 
requires multidimensional integration over all possible parameter vectors { 0]. Since 
it is analytically unfeasible and computationally not practical, an asymptotic MAP 
criterion can be used in practice for selection of best fitting sub-sample [9]. Another 
approach to handle this problem is to use K partial conditionally robust estimates} dj 
of regression coefficients obtained from the sub-samples { 4 }■ Such an approximate 
MAP criterion for model selection also requires robust initial estimates for the 
distribution parameters of inlier and outlier residuals [10]. 

In the proposed image model, the distribution of inlier residuals is supposed to be 
a Gaussian law N(0;c^, whereas the distribution of the outlier magnitude is described 
by its own probability density function. For example, in the developed algorithms for 
image filtering a one-sided Gaussian and Laplacian probability density functions 
G(A;ct) and L(A;ct) (only one half of the shifted density function is used) were 
assumed for the distribution of the difference (A-lr^^J), where r^^^ is the outlier residual 
and A>0 is the outlier range parameter. In the proposed CPR model, the natural 
assumption for outlier distribution law is also a Gaussian distribution N(/j;cr), where h 
is the local contrast. The maximal value of Ir^^J can be adopted for the initial estimate 
of A with respect to initial LS estimates of coefficients, whereas the initial estimate 
for variance d is made over the sub-sample of supposed inlier residuals which yields 
the minimal mean squared deviation. 

The robustness of the proposed estimator is characterized by a breakdown points 
50%, i.e. the method yields a bounded estimate for the vector of regression 
parameters even in the case when (A-l)/2 points in the sample of Appoints are outliers 
with arbitrary large amplitudes. The statistical efficiency of this robust estimate is 
determined by the total number of points in the sub-sample selected by Eq. (3.1). 



3.2 Application of Robust Intensity Estimation to Image Noise Filtering 

The image filtering can be considered as a robust estimation of model parameters 
within a window VT(m,v) of Appoints and computation of image intensity in point (m,v) 
as the value of the polynomial function in point (m,v) with estimated coefficients. 
However, the estimation of polynomial coefficients described in Section 3.1 is not 
practical for noise filtering because of computationally prohibitive number of all 
possible sub-samples of pixels equal to 2^’*. In contrast to known adaptive filtering 
methods, the proposed structure-adaptive approach yields an acceptable solution to 
this computationally complex problem by using intensity estimates over a restricted 
number of L structuring regions (Section 2.1) instead of all possible sub-samples. A 
pixel value will be estimated over one appropriate structuring region from L multiple 
structuring regions {Viii,j)} in the underlying model. The number M of the most 
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suitable structuring region for pixel estimation is determined by the Bayes rule: 

M -dxgmax{P{l)- P{x H)} , (3.2) 

0<1<L 



where P(l) is the a priori probability for fth structuring region; P(x/l) is the 
conditional probability for a feature vector x, and I takes values over all considered L 
structuring regions. The feature vector x is composed of the residual vector of 
polynomial regression with respect to current structuring region Viii,j) (for inliers 
{r;n}) and the background sub-region B;(/j) corresponding to Vi(i,j) (for outliers 
{^out})- The inlier residuals have a Gaussian distribution N(0;c) whereas the 
distribution of outlier residuals can be approximated, for instance, by a one-sided 
Gaussian distribution G(A;c). For equal {P(0} and equal size of structuring regions, 
not including the symmetric region, the selection by Eq. (3.2) is reduced to the rule: 

M = arg min{ ^ + X I -A)^ } ■ (3.3) 

^ meV/ outr^Bi 

Thus, the complete enumeration of suh-samples in Eq. (3.1) is substituted by the 
consideration of a relatively small number L of structuring regions. Since it is 
supposed that the symmetric structuring region is without edges, it is excluded from 
the selection on the first step. The second step consists of selection of one final region 
from two remaining regions: selected edge structuring region and symmetric 
structuring region. It is made hy the likelihood ratio rule assuming respective 
conditional distributions for inlier and outlier residuals and a significance level. 

3.3 Application of Robust Intensity Estimation to Local Binarization 

The goal of binary segmentation is to extract objects of interest from the background 
and assign corresponding labels to them. The known techniques for image 
segmentation by a thresholding operation are mostly based on threshold determination 
from image histograms. In applications with low-contrast objects of interest located 
on noisy background, these methods fail to segment correctly objects of interest 
because the image histograms are usually not unimodal. 

The main idea behind the proposed model-based method of binarization is to 
apply the thresholding only in fragments which satisfy the CPR model condition for 
image fragments in W(m,n). In fact, the discrimination of a two-region fragment is 
made during the stage of selection of a single structuring region out of two regions: 
symmetric structuring region and best fitting edge structuring region. No thresholding 
is made when a symmetric structuring region is selected, otherwise the binarization of 
the fragment is performed as an intensity thresholding with a variable threshold. If the 
CPR model is assumed, the threshold surface has the same regression coefficients as 
the intensity function in the selected structuring region except for the intercept 
coefficient. The coefficients are robustly estimated during the selection of structuring 
regions (Section 3.1). The explicit value for the variable threshold is determined 
based on the likelihood ratio rule. Assuming the CPR model, i.e. the inlier (object) 
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Fig. 2. Flowchart of structure-adaptive filtering and segmentation. 
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(c) (d) 

Fig. 3. Result of filtering of noise corrupted image AZ by: (b) median filter , fc) LMS 
filter, and (d) proposed structure-adaptive filter with q=Q and window size N=3x3. 
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residuals have the distribution N(/t;a^) and the outlier (background) residuals have 
the distribution N(0;C7^), the thresholding is made by the testing: 

g(by)>/(b;) + ^ + -^-ln/l , (3.4) 

2 h 

where /(tj) is the reconstructed polynomial function based on the estimated regression 
coefficients, A is the ratio of prior probabilities P(0)/P(l) for background and object 
points, respectively, h is the estimated local contrast as the difference value between 
object and background intensities, and is the estimated noise variance. The right 
part of Eq. (3.4) represents a variable (floating) threshold surface if a two-region 
fragment has been detected. 




Fig. 4. RMSE of image AZ corrupted by mixed normal and impulsive noise obtained 
for different robust filtering techniques. 



4 Experimental Results 

The proposed technique for structure-adaptive filtering has been tested on different 
real, composite (natural initial Image and artificial noise) and synthetic images 
including the synthetic image AZ (Fig. 3). The initially bilevel image AZ has been 
chosen for experiments because it contains abrupt edges of various shape. The results 
of structure-adaptive filtering in comparison with the conventional median filtering 
[1] and the adaptive LMS filter based on a modified LMS robust estimate [2,8] are 
shown in Fig. 3 as applied to image AZ. The conventional median filter has been 
selected for comparison because it provides in practice good results at low 
computational expenses. For computational reason, the structuring regions in all 
experiments coincide with respective shifted versions of one structuring element 
consisting of N=3x3 points. The polynomial regression of first order (^=1, number 
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of coefficients in the model (2.1) (f+l)=3) and zero order (^=0) has been used in 
most experiments. The experimental results have confirmed theoretical conclusion 
that the proposed filter removes well noise and preserves edges at the same time. The 
results of comparison in term of the root mean square error (RMSE) of restoration 
with the median filter [1] and the modified LMS filter [2] are shown in Fig. 4. Some 
results of multi-scale binary segmentation by using the proposed approach as applied 
to radiographic images are shown in Fig. 5 and Fig. 6. The quality of binary 
segmentation and detection of a two-region fragment has been evaluated in terms of 
correct binary segmentation with respect to signal-to-noise ratio. 




(c) (d) 

Fig. 5. Results of object extraction by the binary segmentation of medical radiographic images. 




Fig. 6. Local binarization of a radiographic weld image for detection of weld defects. 
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5 Conclusion 



The proposed algorithm for robust intensity estimation using MAP sub-sample 
selection has been presented and tested while solving image noise filtering and 
binarization problems. It is a model-based approach using the concept of multiple 
structuring regions as a principle of adaptive estimation of a pixel value corrupted by 
noise. 

The algorithm of structure-adaptive filtering removes well noise and does not blur 
edges and small isolated objects. Application of the robust parameter estimation to 
optimal threshold determination allows to perform reliable binary segmentation of 
low-contrast and noisy image fragments. It is confirmed by experimental results on 
radiographic images from non-destructive testing in industry and medical diagnostics 
imaging. 
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